Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.
1,147 PAPERS • 19 BENCHMARKS
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
739 PAPERS • 9 BENCHMARKS
The Flickr30K Entities dataset is an extension to the Flickr30K dataset. It augments the original 158k captions with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. This is used to define a new benchmark for localization of textual entity mentions in an image.
132 PAPERS • 2 BENCHMARKS
The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).e
22 PAPERS • NO BENCHMARKS YET
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
2 PAPERS • NO BENCHMARKS YET
VD-Ref is a dataset with ground-truth mappings from both noun phrases and pronouns to image regions. This dataset contains a set of 10k complete sets from the VisDialog dataset, and uses the StanfordCoreNLP tool to tokenize the sentences, making it proper for the succeeding human annotation.
1 PAPER • NO BENCHMARKS YET