56 dataset results for Visual Question Answering (VQA) AND English

CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.

16 PAPERS • 2 BENCHMARKS

DVQA (Data Visualizations via Question Answering)

DVQA is a synthetic question-answering dataset on images of bar-charts.

31 PAPERS • 1 BENCHMARK

FM-IQA (Freestyle Multilingual Image Question Answering)

FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.

10 PAPERS • NO BENCHMARKS YET

GuessWhat?!

GuessWhat?! is a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images.

73 PAPERS • NO BENCHMARKS YET

KnowIT VQA

KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.

8 PAPERS • 1 BENCHMARK

OK-VQA (Outside Knowledge Visual Question Answering)

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.

258 PAPERS • 2 BENCHMARKS

ST-VQA (Scene Text Visual Question Answering)

ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.

74 PAPERS • NO BENCHMARKS YET

simply-CLEVR

The simply-CLEVR dataset aims to provide a benchmark dataset that can be used for transparent quantitative evaluation of explanation methods (aka heatmaps/XAI methods). It is made of simple Visual Question Answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by two Ground Truth Masks that serve as a basis for evaluating explanations on the input image.

1 PAPER • NO BENCHMARKS YET

Datasets

56 dataset results for Visual Question Answering (VQA) AND English