Visual Question Answering

680 papers with code • 20 benchmarks • 21 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Question Answering

Dataset	Best Model	Compare
MM-Vet	GPT-4V	See all
ViP-Bench	GPT-4V-turbo-detail:high (Visual Prompt)	See all
BenchLMM	GPT-4V	See all
VQA v2 test-dev	BLIP-2 ViT-G OPT 6.7B (fine-tuned)	See all
MSRVTT-QA	Aurora (ours, r=64) Aurora (ours, r=64)	See all
VQA v2 val	BLIP-2 ViT-G OPT 6.7B (fine-tuned)	See all
MSVD-QA	GIT+MDF	See all
VQA v2 test-std	LXMERT (low-magnitude pruning)	See all
MMBench	CuMo-7B	See all
TextVQA test-standard	PromptCap	See all
COCO Visual Question Answering (VQA) real images 2.0 open ended	MaMMUT (2B)	See all
GRIT	OFA	See all
VQA v2	Emu-I *	See all
VizWiz	Emu-I *	See all
MM-Vet (w/o External Tools)	Emu-14B	See all
PlotQA-D1	MatCha	See all
PlotQA-D2	MatCha	See all
MS COCO	BenchLMM	See all
EarthVQA	SOBA	See all
VisualMRC	LayoutT5 (Large)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Question Answering models and implementations

huggingface/transformers

8 papers

126,923

faceonlive/ai-research

6 papers

259

salesforce/lavis

5 papers

8,979

gabegrand/adversarial-vqa

4 papers

See all 12 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Hierarchical Question-Image Co-Attention for Visual Question Answering

jiasenlu/HieCoAttenVQA • • NeurIPS 2016

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Paper
Code

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

airsplay/lxmert • • IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Paper
Code

GPT-4 Technical Report

openai/evals • Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

Paper
Code

Visual Instruction Tuning

haotian-liu/LLaVA • • NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Paper
Code

Hadamard Product for Low-rank Bilinear Pooling

jnhwkim/MulLowBiVQA • • 14 Oct 2016

Bilinear models provide rich representations compared with linear models.

Paper
Code

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

peteanderson80/Matterport3DSimulator • • CVPR 2018

This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.

Paper
Code

Bilinear Attention Networks

jnhwkim/ban-vqa • • NeurIPS 2018

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.

Paper
Code

Simple Baseline for Visual Question Answering

metalbubble/VQAbaseline • 7 Dec 2015

We describe a very simple bag-of-words baseline for visual question answering.

Paper
Code

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

necla-ml/SNLI-VE • CVPR 2017

We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter!

Paper
Code

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

sea-snell/implicit-language-q-learning • • ICCV 2017

Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.

Paper
Code

Visual Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result