Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.
1,151 PAPERS • 19 BENCHMARKS
LayoutBench is a diagnostic benchmark that examines 4 spatial control skills (number, position, size, shape), where each skill consists of 2 OOD layout splits, i.e., in total 8 tasks = 4 skills x 2 splits. To disentangle spatial control from other aspects of image generation, such as generating diverse objects, LayoutBench keeps the object configurations of CLEVR, and changes the spatial layouts.
1 PAPER • 1 BENCHMARK