🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

65 dataset results for Image Captioning

BNATURE

This is a dataset for Bengali Captioning from Images.

1 PAPER • NO BENCHMARKS YET

CapGaze

Consists of eye movements and verbal descriptions recorded synchronously over images.

1 PAPER • NO BENCHMARKS YET

ESP (Evaluation for Styled Prompt)

ESP dataset (Evaluation for Styled Prompt dataset) is a benchmark for zero-shot domain-conditional caption generation. ESP is a new dataset focusing on providing multiple styled text targets for the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collect five text domains with everyday usage: blog, social media, instruction, story, and news.

1 PAPER • NO BENCHMARKS YET

Image Caption Quality Dataset

Image Caption Quality Dataset is a dataset of crowdsourced ratings for machine-generated image captions. It contains more than 600k ratings of image-caption pairs.

1 PAPER • NO BENCHMARKS YET

InFashAI

InFashAI (Inclusive Fashion AI)

AI algorithms, and in particular Machine Learning (ML) algorithms, learn from data tasks that have been traditionally done by humans such as: image classification, facial recognition, linguistic translation etc. To have a good generalization capability, AI algorithms must learn from sufficiently representative data, which is unfortunately not often the case. This results in a hyper-specialization of AI and its inability to perform well on new data whose distribution is too far from the one of the training set. It raises ethical questions which will undoubtedly have direct or indirect consequences on society. However, and despite biases they can entail, AI technologies are revolutionizing virtually every industry, and are forcing players in those industries to reinvent their businesses.

1 PAPER • NO BENCHMARKS YET

MCIC-COCO

A large-scale machine comprehension dataset (based on the COCO images and captions).

1 PAPER • NO BENCHMARKS YET

OpenCHAIR

OpenCHAIR is a benchmark for evaluating open-vocabulary hallucinations in image captioning models. By leveraging the linguistic knowledge of LLMs, OpenCHAIR is able to perform fine-grained hallucination measurements, as well as significantly increase the amount of objects that can be measured (especially when compared to the existing benchmark, CHAIR). To exploit the LLM's full potential we construct a new dataset by generating 2000 captions with highly diverse objects and let a powerful text-to-image model generate images for them. We find that we are not just able to increase the benchmark's diversity, but also improve the evaluation accuracy with respect to CHAIR's.

1 PAPER • NO BENCHMARKS YET

ParsVQA-Caps

Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.

1 PAPER • NO BENCHMARKS YET

Polaris (Polaris dataset)

The Polaris dataset offers a large-scale, diverse benchmark for evaluating metrics for image captioning, surpassing existing datasets in terms of size, caption diversity, number of human judgments, and granularity of the evaluations. It includes 131,020 generated captions and 262,040 reference captions. The generated captions have a vocabulary of 3,154 unique words and the reference captions have a vocabulary of 22,275 unique words.

1 PAPER • NO BENCHMARKS YET

PoseScript

PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.

1 PAPER • NO BENCHMARKS YET

RPCD (Reddit Photo Critique Dataset)

The Reddit Photo Critique Dataset (RPCD) contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback.

1 PAPER • NO BENCHMARKS YET

T2 Guiding

T2 Guiding is a dataset of 1000 images, each with six image labels. The images are from the Open Images Dataset (OID) and the dataset includes 2 sets of machine-generated labels for these images.

1 PAPER • NO BENCHMARKS YET

VDQG (Visual Discriminative Question Generation)

The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.

1 PAPER • NO BENCHMARKS YET

WebLI

WebLI (Web Language Image)

WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.

1 PAPER • NO BENCHMARKS YET

WikiScenes

The WikiScenes dataset consists of paired images and language descriptions capturing world landmarks and cultural sites, with associated 3D models and camera poses. WikiScenes is derived from the massive public catalog of freely-licensed crowdsourced data in the Wikimedia Commons project, which contains a large variety of images with captions and other metadata.

1 PAPER • NO BENCHMARKS YET

WikiWeb2M (Wikipedia Webpage 2M)

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

1 PAPER • NO BENCHMARKS YET

ESP Dataset (Evaluation for Styled Prompt datase)

ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.

0 PAPER • NO BENCHMARKS YET

Datasets

65 dataset results for Image Captioning