DialogCC is a large-scale multi-modal dialogue dataset, which covers diverse real-world topics and various images per dialogue. It contains 651k unique images and is designed for image and text retrieval tasks.
1 PAPER • NO BENCHMARKS YET
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
7 PAPERS • NO BENCHMARKS YET
The IMAGE-CHAT dataset is a large collection of (image, style trait for speaker A, style trait for speaker B, dialogue between A & B) tuples that we collected using crowd-workers, Each dialogue consists of consecutive turns by speaker A and B. No particular constraints are placed on the kinds of utterance, only that we ask the speakers to both use the provided style trait, and to respond to the given image and dialogue history in an engaging way. The goal is not just to build a diagnostic dataset but a basis for training models that humans actually want to engage with.
28 PAPERS • 2 BENCHMARKS