Visual Question Answering (VQA)
763 papers with code • 62 benchmarks • 112 datasets
Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.
Image Source: visualqa.org
Libraries
Use these libraries to find Visual Question Answering (VQA) models and implementationsDatasets
Subtasks
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.
Compositional Attention Networks for Machine Reasoning
We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.
Hierarchical Question-Image Co-Attention for Visual Question Answering
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Pythia v0.1: the Winning Entry to the VQA Challenge 2018
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
GPT-4 Technical Report
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
Hadamard Product for Low-rank Bilinear Pooling
Bilinear models provide rich representations compared with linear models.
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.