Visual Question Answering v2.0 (VQA v2.0)

Introduced by Goyal et al. in Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.

  • 265,016 images (COCO and abstract scenes)
  • At least 3 questions (5.4 questions on average) per image
  • 10 ground truth answers per question
  • 3 plausible (but likely incorrect) answers per question
  • Automatic evaluation metric

The first version of the dataset was released in October 2015.


Paper Code Results Date Stars


Similar Datasets


  • Unknown