Visual Question Answering (VQA)

758 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Question Answering (VQA)

Dataset	Best Model	Compare
VQA v2 test-dev	PaLI	See all
VQA v2 test-std	BEiT-3	See all
OK-VQA	PaLI-X-VPD	See all
MSVD-QA	VLAB	See all
DocVQA test	Human	See all
MSRVTT-QA	VLAB	See all
InfographicVQA	Gemini Ultra (pixel only)	See all
COCO Visual Question Answering (VQA) real images 1.0 open ended	SAN	See all
CLEVR	NS-VQA (1K programs)	See all
GQA test-dev	CFR	See all
InfiMM-Eval	GPT-4V	See all
A-OKVQA	SMoLA-PaLI-X Specialist Model	See all
IconQA	ViLT	See all
VCR (Q-A) test	GPT4RoI	See all
VQA v2 val	BLIP-2 ViT-G FlanT5 XXL (zero-shot)	See all
COCO Visual Question Answering (VQA) real images 1.0 multiple choice	MCB 7 att.	See all
VQA-CP	CSS	See all
VQA-CE	RandImg	See all
VCR (QA-R) test	GPT4RoI	See all
VCR (Q-AR) test	GPT4RoI	See all
GQA test-std	NSM	See all
VQA v1 test-dev	SAAA (ResNet)	See all
IllusionVQA	GPT4-Vision	See all
VQA v1 test-std	RAU (ResNet)	See all
GQA Test2019	TRRNet (Ensemble)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
InfoSeek	RA-VQAv2 w/ PreFLMR	See all
CLEVR-Humans	MDETR	See all
QLEVR	MAC	See all
COCO Visual Question Answering (VQA) abstract images 1.0 open ended	Graph VQA	See all
COCO Visual Question Answering (VQA) abstract 1.0 multiple choice	Graph VQA	See all
Visual7W	CMN	See all
COCO Visual Question Answering (VQA) real images 2.0 open ended	HDU-USYD-UNCC	See all
PMC-VQA	MedVInT	See all
AI2D	SMoLA-PaLI-X Specialist Model	See all
VCR (Q-A) dev	VL-BERTLARGE	See all
VCR (QA-R) dev	VL-BERTLARGE	See all
VCR (Q-AR) dev	VL-BERTLARGE	See all
VizWiz 2018	Colin	See all
VizWiz 2020 VQA	PaLI	See all
PlotQA-D1	MatCha	See all
FigureQA - test 1	PReFIL	See all
PlotQA-D2	MatCha	See all
F-VQA	ZS-F-VQA	See all
HallusionBench	GPT-4V	See all
TDIUC	Accuracy	See all
VizWiz 2020 Answerability	CLIP-Ensemble	See all
TextVQA test-standard	TAP	See all
GQA	RelViT	See all
GRIT	Unified-IOXL	See all
TGIF-QA	HiTeA	See all
VQA-X	OFA-X-MT	See all
Visual Genome (pairs)	CMN	See all
Visual Genome (subjects)	CMN	See all
DocVQA val	BERT LARGE Baseline	See all
ZS-F-VQA	SAN † - hard mask	See all
WebSRC	DUBLIN	See all
DeepForm	DUBLIN	See all
SciGraphQA	SciGraphQA-baseline	See all
DVQA test-familiar	PReFIL (Oracle OCR)	See all
CORE-MM	GPT-4V	See all
VizWiz 2018 Answerability	ensemble_two_best	See all

Show all 62 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

huggingface/transformers

13 papers

124,527

salesforce/lavis

7 papers

8,685

ZephyrZhuQi/ssbaseline

5 papers

gabegrand/adversarial-vqa

5 papers

See all 15 libraries.

Datasets

Subtasks

Embodied Question Answering

3D Question Answering (3D-QA)

Generative Visual Question Answering

Factual Visual Question Answering

Latest papers with no code

Most implemented Social Latest No code

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

no code yet • 27 Mar 2024

Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis.

Paper
Add Code

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

no code yet • 26 Mar 2024

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI.

Paper
Add Code

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

no code yet • 25 Mar 2024

In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning.

Paper
Add Code

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

no code yet • 20 Mar 2024

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation.

Paper
Add Code

Multi-Modal Hallucination Control by Visual Information Grounding

no code yet • 20 Mar 2024

In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations.

Paper
Add Code

WoLF: Wide-scope Large Language Model Framework for CXR Understanding

no code yet • 19 Mar 2024

(1) Previous methods solely use CXR reports, which are insufficient for comprehensive Visual Question Answering (VQA), especially when additional health-related data like medication history and prior diagnoses are needed.

Paper
Add Code

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

no code yet • 18 Mar 2024

The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.

Paper
Add Code

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

no code yet • 17 Mar 2024

Two approaches have emerged to input images into large language models (LLMs).

Paper
Add Code

Mitigating Dialogue Hallucination for Large Multi-modal Models via Adversarial Instruction Tuning

no code yet • 15 Mar 2024

To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs.

Paper
Add Code

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

no code yet • 15 Mar 2024

By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels.

Paper
Add Code

Visual Question Answering (VQA)

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result