Texts

Visual Question Answering v2.0 (VQA v2.0)

Introduced by Goyal et al. in Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.

265,016 images (COCO and abstract scenes)
At least 3 questions (5.4 questions on average) per image
10 ground truth answers per question
3 plausible (but likely incorrect) answers per question
Automatic evaluation metric

The first version of the dataset was released in October 2015.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Visual Question Answering (VQA)	VQA v2 test-dev	PaLI
Visual Question Answering (VQA)	VQA v2 test-std	BEiT-3
Visual Question Answering (VQA)	VQA v2 val	BLIP-2 ViT-G FlanT5 XXL
Visual Question Answering	VQA v2 test-dev	BLIP-2 ViT-G OPT 6.7B
Visual Question Answering	VQA v2 val	BLIP-2 ViT-G OPT 6.7B
Visual Question Answering	VQA v2 test-std	LXMERT
Visual Question Answering	VQA v2	Emu-I *