Visual Question Answering
680 papers with code • 20 benchmarks • 21 datasets
Libraries
Use these libraries to find Visual Question Answering models and implementationsDatasets
Most implemented papers
Hierarchical Question-Image Co-Attention for Visual Question Answering
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
GPT-4 Technical Report
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
Hadamard Product for Low-rank Bilinear Pooling
Bilinear models provide rich representations compared with linear models.
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.
Bilinear Attention Networks
In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.
Simple Baseline for Visual Question Answering
We describe a very simple bag-of-words baseline for visual question answering.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter!
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.