2 dataset results for Video Understanding AND Images AND English

ImageCoDe (Image Retrieval from Contextual Descriptions)

Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.

8 PAPERS • 1 BENCHMARK

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 PAPERS • NO BENCHMARKS YET

Datasets

2 dataset results for Video Understanding AND Images AND English