The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
11,149 PAPERS • 96 BENCHMARKS
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
817 PAPERS • 11 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
334 PAPERS • 2 BENCHMARKS
The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. The proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered.
199 PAPERS • 7 BENCHMARKS
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
193 PAPERS • 4 BENCHMARKS
The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. Images were licensed from Getty Images so that researchers can use the data set to support their work.
162 PAPERS • 2 BENCHMARKS
The nocaps benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets.
156 PAPERS • 13 BENCHMARKS
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
87 PAPERS • 1 BENCHMARK
The ReferIt dataset contains 130,525 expressions for referring to 96,654 objects in 19,894 images of natural scenes.
79 PAPERS • NO BENCHMARKS YET
Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly -- but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance.
79 PAPERS • 1 BENCHMARK
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
75 PAPERS • NO BENCHMARKS YET
We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.
59 PAPERS • 5 BENCHMARKS
The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image.
58 PAPERS • 3 BENCHMARKS
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
48 PAPERS • NO BENCHMARKS YET
A new large-scale dataset for referring expressions, based on MS-COCO.
39 PAPERS • 6 BENCHMARKS
Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:
34 PAPERS • 4 BENCHMARKS
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2,487/2,489 images.
31 PAPERS • 2 BENCHMARKS
UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.
30 PAPERS • 1 BENCHMARK
The SentiCap dataset contains several thousand images with captions with positive and negative sentiments. These sentimental captions are constructed by the authors by re-writing factual descriptions. In total there are 2000+ sentimental captions.
26 PAPERS • NO BENCHMARKS YET
FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and stylized captions including humorous and romantic styles. However, only 7,000 pairs from the official training set are now publicly accessible. The dataset can be downloaded via https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip
23 PAPERS • 2 BENCHMARKS
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
21 PAPERS • 3 BENCHMARKS
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels.
20 PAPERS • 1 BENCHMARK
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
14 PAPERS • NO BENCHMARKS YET
IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The dataset contains 7,470 pairs of images and reports.
14 PAPERS • 2 BENCHMARKS
SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.
14 PAPERS • 1 BENCHMARK
Consists of over 39,000 images originating from people who are blind that are each paired with five captions.
12 PAPERS • NO BENCHMARKS YET
A collection that allows researchers to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.
11 PAPERS • NO BENCHMARKS YET
WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
11 PAPERS • 4 BENCHMARKS
Object HalBench is a benchmark used to evaluate the performance of Language Models, particularly those that are multimodal (i.e., they can process and generate both text and images). It's designed to test how well these models can avoid "hallucinations" - generating text that is not factually grounded in the images they're processing¹.
9 PAPERS • 1 BENCHMARK
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
7 PAPERS • NO BENCHMARKS YET
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
6 PAPERS • 1 BENCHMARK
UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.
6 PAPERS • NO BENCHMARKS YET
This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native Bengali speakers. All popular image captioning datasets have a predominant western cultural bias with the annotations done in English. Using such datasets to train an image captioning system assumes that a good English to target language translation system exists and that the original dataset had elements of the target culture. Both these assumptions are false, leading to the need of a culturally relevant dataset in Bengali, to generate appropriate image captions of images relevant to the Bangladeshi and wider subcontinental context. The dataset presented consists of 9,154 images.
5 PAPERS • 1 BENCHMARK
Contains 8k flickr Images with captions. Visit this page to explore the data.
5 PAPERS • 2 BENCHMARKS
Peir Gross (Jing et al., 2018) was collected with descriptions in the Gross sub-collection from PEIR digital library, resulting in 7.442 image-caption pairs from 21 different sub-categories. Each caption contains only one sentence.
PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.
5 PAPERS • NO BENCHMARKS YET
Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.
Concadia is a publicly available Wikipedia-based corpus, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context.
4 PAPERS • 1 BENCHMARK
A new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
4 PAPERS • NO BENCHMARKS YET
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.
The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).
3 PAPERS • NO BENCHMARKS YET
LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from the english subset of Laion-5B with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64). This dataset allow models to produce high quality captions for images.
3 PAPERS • 1 BENCHMARK
OpenCHAIR is a benchmark for evaluating open-vocabulary hallucinations in image captioning models. By leveraging the linguistic knowledge of LLMs, OpenCHAIR is able to perform fine-grained hallucination measurements, as well as significantly increase the amount of objects that can be measured (especially when compared to the existing benchmark, CHAIR). To exploit the LLM's full potential we construct a new dataset by generating 5000 captions with highly diverse objects and let a powerful text-to-image model generate images for them. We find that we are not just able to increase the benchmark's diversity, but also improve the evaluation accuracy with respect to CHAIR's.
The Polaris dataset offers a large-scale, diverse benchmark for evaluating metrics for image captioning, surpassing existing datasets in terms of size, caption diversity, number of human judgments, and granularity of the evaluations. It includes 131,020 generated captions and 262,040 reference captions. The generated captions have a vocabulary of 3,154 unique words and the reference captions have a vocabulary of 22,275 unique words.
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
The WikiScenes dataset consists of paired images and language descriptions capturing world landmarks and cultural sites, with associated 3D models and camera poses. WikiScenes is derived from the massive public catalog of freely-licensed crowdsourced data in the Wikimedia Commons project, which contains a large variety of images with captions and other metadata.
This is an open-source image captions dataset for the aesthetic evaluation of images. The dataset is called DPC-Captions, which contains comments of up to five aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset.
2 PAPERS • NO BENCHMARKS YET