ReferIt3D provides two large-scale and complementary visio-linguistic datasets: i) Sr3D, which contains 83.5K template-based utterances leveraging spatial relations among fine-grained object classes to localize a referred object in a scene, and ii) Nr3D which contains 41.5K natural, free-form, utterances collected by deploying a 2-player object reference game in 3D scenes. This dataset can be used for 3D visual grounding and 3D dense captioning tasks.
46 PAPERS • NO BENCHMARKS YET
A new large-scale dataset for referring expressions, based on MS-COCO.
29 PAPERS • 4 BENCHMARKS
Spot-the-diff is a dataset consisting of 13,192 image pairs along with corresponding human provided text annotations stating the differences between the two images.
22 PAPERS • NO BENCHMARKS YET
MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.
19 PAPERS • 2 BENCHMARKS
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
8 PAPERS • 1 BENCHMARK
METU-VIREF is a video referring expression dataset comprising of videos from VIRAT Ground and ILSVRC2015 VID datasets. VIRAT is a surveillance dataset and contains mainly people and vehicles. To line up with this and restrict the domain, only videos that contain vehicles from the ILSVRC dataset are used. The METU-VIREF dataset does not contain whole videos from these datasets (the videos need to be downloaded from the respective sources) but just referring expressions for video sequences containing an object pair. For this, object pairs are chosen which had a relation that a meaningful referring expression could be written for.
1 PAPER • NO BENCHMARKS YET