5 dataset results for Retrieval AND Images AND English

OK-VQA (Outside Knowledge Visual Question Answering)

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.

259 PAPERS • 2 BENCHMARKS

ARO

Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.

22 PAPERS • NO BENCHMARKS YET

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

17 PAPERS • 2 BENCHMARKS

DIOR-RSVG

DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.

7 PAPERS • NO BENCHMARKS YET

ShapeTalk (The ShapeTalk Dataset)

ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances

2 PAPERS • NO BENCHMARKS YET

Datasets

5 dataset results for Retrieval AND Images AND English