The RefCOCO dataset is a referring expression generation (REG) dataset used for tasks related to understanding natural language expressions that refer to specific objects in images. Here are the key details about RefCOCO:
302 PAPERS • 19 BENCHMARKS
CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
16 PAPERS • 2 BENCHMARKS
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
Description Detection Dataset ($D^3$, /dikju:b/) is an attempt at creating a next-generation object detection dataset. Unlike traditional detection datasets, the class names of the objects are no longer simple nouns or noun phrases, but rather complex and descriptive, such as a dog not being held by a leash. For each image in the dataset, any object that matches the description is annotated. The dataset provides annotations such as bounding boxes and finely crafted instance masks.It comprises of 422 well-designed descriptions and 24,282 positive object-description pairs.
8 PAPERS • 1 BENCHMARK
A Game Of Sorts is a collaborative image ranking task. Players are asked to rank a set of images based on a given sorting criterion. The game provides a framework for the evaluation of visually grounded language understanding and generation of referring expressions in multimodal dialogue settings.
2 PAPERS • NO BENCHMARKS YET
In Visual Query Detection (VQD), a system is given a query (prompt) natural language and an image, and then the system must produce 0 - N boxes that satisfy that query. VQD is related to several other tasks in computer vision, but it captures abilities these other tasks ignore. Unlike object detection, VQD can deal with attributes and relations among objects in the scene. In VQA, often algorithms produce the right answers due to dataset bias without `looking' at relevant image regions. Referring Expression Recognition (RER) datasets have short and often ambiguous prompts, and by having only a single box as an output, they make it easier to exploit dataset biases. VQD requires goal-directed object detection and outputting a variable number of boxes that answer a query.
1 PAPER • 1 BENCHMARK