Visual Question Answering (VQA)

764 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Question Answering (VQA)

Dataset	Best Model	Compare
VQA v2 test-dev	PaLI	See all
VQA v2 test-std	BEiT-3	See all
OK-VQA	PaLI-X-VPD	See all
MSVD-QA	VLAB	See all
DocVQA test	Human	See all
MSRVTT-QA	VLAB	See all
InfographicVQA	Gemini Ultra (pixel only)	See all
COCO Visual Question Answering (VQA) real images 1.0 open ended	SAN	See all
CLEVR	NS-VQA (1K programs)	See all
GQA test-dev	CFR	See all
InfiMM-Eval	GPT-4V	See all
A-OKVQA	SMoLA-PaLI-X Specialist Model	See all
IconQA	ViLT	See all
VCR (Q-A) test	GPT4RoI	See all
VQA v2 val	BLIP-2 ViT-G FlanT5 XXL (zero-shot)	See all
COCO Visual Question Answering (VQA) real images 1.0 multiple choice	MCB 7 att.	See all
VQA-CP	CSS	See all
VQA-CE	RandImg	See all
VCR (QA-R) test	GPT4RoI	See all
VCR (Q-AR) test	GPT4RoI	See all
GQA test-std	NSM	See all
VQA v1 test-dev	SAAA (ResNet)	See all
IllusionVQA	GPT4-Vision	See all
VQA v1 test-std	RAU (ResNet)	See all
GQA Test2019	TRRNet (Ensemble)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
InfoSeek	RA-VQAv2 w/ PreFLMR	See all
CLEVR-Humans	MDETR	See all
QLEVR	MAC	See all
COCO Visual Question Answering (VQA) abstract images 1.0 open ended	Graph VQA	See all
COCO Visual Question Answering (VQA) abstract 1.0 multiple choice	Graph VQA	See all
Visual7W	CMN	See all
COCO Visual Question Answering (VQA) real images 2.0 open ended	HDU-USYD-UNCC	See all
PMC-VQA	MedVInT	See all
AI2D	SMoLA-PaLI-X Specialist Model	See all
VCR (Q-A) dev	VL-BERTLARGE	See all
VCR (QA-R) dev	VL-BERTLARGE	See all
VCR (Q-AR) dev	VL-BERTLARGE	See all
VizWiz 2018	Colin	See all
VizWiz 2020 VQA	PaLI	See all
PlotQA-D1	MatCha	See all
FigureQA - test 1	PReFIL	See all
PlotQA-D2	MatCha	See all
F-VQA	ZS-F-VQA	See all
HallusionBench	GPT-4V	See all
TDIUC	Accuracy	See all
VizWiz 2020 Answerability	CLIP-Ensemble	See all
TextVQA test-standard	TAP	See all
GQA	RelViT	See all
GRIT	Unified-IOXL	See all
TGIF-QA	HiTeA	See all
VQA-X	OFA-X-MT	See all
Visual Genome (pairs)	CMN	See all
Visual Genome (subjects)	CMN	See all
DocVQA val	BERT LARGE Baseline	See all
ZS-F-VQA	SAN † - hard mask	See all
WebSRC	DUBLIN	See all
DeepForm	DUBLIN	See all
SciGraphQA	SciGraphQA-baseline	See all
DVQA test-familiar	PReFIL (Oracle OCR)	See all
CORE-MM	GPT-4V	See all
VizWiz 2018 Answerability	ensemble_two_best	See all

Show all 62 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

huggingface/transformers

10 papers

125,290

salesforce/lavis

7 papers

8,762

ZephyrZhuQi/ssbaseline

5 papers

gabegrand/adversarial-vqa

5 papers

See all 15 libraries.

Datasets

Subtasks

Embodied Question Answering

3D Question Answering (3D-QA)

Generative Visual Question Answering

Factual Visual Question Answering

Latest papers

Most implemented Social Latest No code

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

FrankZxShen/ATS • 14 Mar 2024

Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.

14 Mar 2024

Paper
Code

Multi-modal Auto-regressive Modeling via Visual Words

pengts/vw-lmm • • 12 Mar 2024

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities.

12 Mar 2024

Paper
Code

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

yuliang-liu/monkey • • 7 Mar 2024

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.

1,396

07 Mar 2024

Paper
Code

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

lab-rasool/awesome-medical-vlms-and-datasets • 4 Mar 2024

Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering (VQA).

04 Mar 2024

Paper
Code

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

matthewdm0816/bridgeqa • • 24 Feb 2024

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e. g., only around 800 scenes are utilized in ScanQA and SQA dataset).

24 Feb 2024

Paper
Code

Uncertainty-Aware Evaluation for Vision-Language Models

ensec-ai/vlm-uncertainty-bench • • 22 Feb 2024

Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs.

22 Feb 2024

Paper
Code

CommVQA: Situating Visual Question Answering in Communicative Contexts

nnaik39/commvqa • • 22 Feb 2024

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation.

22 Feb 2024

Paper
Code

CoLLaVO: Crayon Large Language and Vision mOdel

ByungKwanLee/CoLLaVO • • 17 Feb 2024

Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.

17 Feb 2024

Paper
Code

Multi-modal preference alignment remedies regression of visual instruction tuning on language model

findalexli/mllm-dpo • • 16 Feb 2024

In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.

16 Feb 2024

Paper
Code

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

opengvlab/multi-modality-arena • • 14 Feb 2024

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

364

14 Feb 2024

Paper
Code

Visual Question Answering (VQA)

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result