The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
6,208 PAPERS • 76 BENCHMARKS
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
444 PAPERS • 9 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
179 PAPERS • 1 BENCHMARK
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
102 PAPERS • 6 BENCHMARKS
The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. Images were licensed from Getty Images so that researchers can use the data set to support their work.
71 PAPERS • 1 BENCHMARK
The ReferIt dataset contains 130,525 expressions for referring to 96,654 objects in 19,894 images of natural scenes.
54 PAPERS • NO BENCHMARKS YET
The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. The proposed challenge addresses the following two tasks for this dataset: predict the answer to a visual question and (2) predict whether a visual question cannot be answered.
47 PAPERS • 6 BENCHMARKS
The nocaps benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets.
40 PAPERS • 13 BENCHMARKS
We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.
27 PAPERS • 5 BENCHMARKS
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
25 PAPERS • NO BENCHMARKS YET
UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2,487/2,489 images.
18 PAPERS • 1 BENCHMARK
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
The SentiCap dataset contains several thousand images with captions with positive and negative sentiments. These sentimental captions are constructed by the authors by re-writing factual descriptions. In total there are 2000+ sentimental captions.
17 PAPERS • NO BENCHMARKS YET
A new large-scale dataset for referring expressions, based on MS-COCO.
16 PAPERS • 4 BENCHMARKS
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels.
14 PAPERS • NO BENCHMARKS YET
FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and stylized captions including humorous and romantic styles. However, only 7,000 pairs from the official training set are now publicly accessible. The dataset can be downloaded via https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip
13 PAPERS • 1 BENCHMARK
The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image.
10 PAPERS • NO BENCHMARKS YET
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
9 PAPERS • NO BENCHMARKS YET
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
6 PAPERS • 1 BENCHMARK
Consists of over 39,000 images originating from people who are blind that are each paired with five captions.
6 PAPERS • NO BENCHMARKS YET
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
5 PAPERS • NO BENCHMARKS YET
PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.
Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.
This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native Bengali speakers. All popular image captioning datasets have a predominant western cultural bias with the annotations done in English. Using such datasets to train an image captioning system assumes that a good English to target language translation system exists and that the original dataset had elements of the target culture. Both these assumptions are false, leading to the need of a culturally relevant dataset in Bengali, to generate appropriate image captions of images relevant to the Bangladeshi and wider subcontinental context. The dataset presented consists of 9,154 images.
4 PAPERS • 1 BENCHMARK
A collection that allows researchers to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.
4 PAPERS • NO BENCHMARKS YET
STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.
3 PAPERS • NO BENCHMARKS YET
This is an open-source image captions dataset for the aesthetic evaluation of images. The dataset is called DPC-Captions, which contains comments of up to five aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset.
2 PAPERS • NO BENCHMARKS YET
DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.
Egoshots is a 2-month Ego-vision Dataset with Autographer Wearable Camera annotated "for free" with transfer learning. Three state of the art pre-trained image captioning models are used. The dataset represents the life of 2 interns while working at Philips Research (Netherlands) (May-July 2015) generously donating their data.
SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.
2 PAPERS • 1 BENCHMARK
This is a dataset for Bengali Captioning from Images.
1 PAPER • NO BENCHMARKS YET
Consists of eye movements and verbal descriptions recorded synchronously over images.
Concadia is a publicly available Wikipedia-based corpus, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context.
Hephaestus is the first large-scale InSAR dataset. Driven by volcanic unrest detection, it provides 19,919 unique satellite frames annotated with a diverse set of labels. Moreover, each sample is accompanied by a textual description of its contents. The goal of this dataset is to boost research on exploitation of interferometric data enabling the application of state-of-the-art computer vision+NLP methods. Furthermore, the annotated dataset is bundled with a large archive of unlabeled frames to enable large-scale self-supervised learning. The final size of the dataset amounts to 110,573 interferograms.
Image Caption Quality Dataset is a dataset of crowdsourced ratings for machine-generated image captions. It contains more than 600k ratings of image-caption pairs.
A new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).
AI algorithms, and in particular Machine Learning (ML) algorithms, learn from data tasks that have been traditionally done by humans such as: image classification, facial recognition, linguistic translation etc. To have a good generalization capability, AI algorithms must learn from sufficiently representative data, which is unfortunately not often the case. This results in a hyper-specialization of AI and its inability to perform well on new data whose distribution is too far from the one of the training set. It raises ethical questions which will undoubtedly have direct or indirect consequences on society. However, and despite biases they can entail, AI technologies are revolutionizing virtually every industry, and are forcing players in those industries to reinvent their businesses.
A large-scale machine comprehension dataset (based on the COCO images and captions).
The Reddit Photo Critique Dataset (RPCD) contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback.
T2 Guiding is a dataset of 1000 images, each with six image labels. The images are from the Open Images Dataset (OID) and the dataset includes 2 sets of machine-generated labels for these images.
UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.
The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.
VizWiz-Priv includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content.
A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.
The WikiScenes dataset consists of paired images and language descriptions capturing world landmarks and cultural sites, with associated 3D models and camera poses. WikiScenes is derived from the massive public catalog of freely-licensed crowdsourced data in the Wikimedia Commons project, which contains a large variety of images with captions and other metadata.
Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly -- but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance.
1 PAPER • 1 BENCHMARK