This is a dataset for Bengali Captioning from Images.
1 PAPER • NO BENCHMARKS YET
Consists of eye movements and verbal descriptions recorded synchronously over images.
ESP dataset (Evaluation for Styled Prompt dataset) is a benchmark for zero-shot domain-conditional caption generation. ESP is a new dataset focusing on providing multiple styled text targets for the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collect five text domains with everyday usage: blog, social media, instruction, story, and news.
Image Caption Quality Dataset is a dataset of crowdsourced ratings for machine-generated image captions. It contains more than 600k ratings of image-caption pairs.
AI algorithms, and in particular Machine Learning (ML) algorithms, learn from data tasks that have been traditionally done by humans such as: image classification, facial recognition, linguistic translation etc. To have a good generalization capability, AI algorithms must learn from sufficiently representative data, which is unfortunately not often the case. This results in a hyper-specialization of AI and its inability to perform well on new data whose distribution is too far from the one of the training set. It raises ethical questions which will undoubtedly have direct or indirect consequences on society. However, and despite biases they can entail, AI technologies are revolutionizing virtually every industry, and are forcing players in those industries to reinvent their businesses.
A large-scale machine comprehension dataset (based on the COCO images and captions).
OpenCHAIR is a benchmark for evaluating open-vocabulary hallucinations in image captioning models. By leveraging the linguistic knowledge of LLMs, OpenCHAIR is able to perform fine-grained hallucination measurements, as well as significantly increase the amount of objects that can be measured (especially when compared to the existing benchmark, CHAIR). To exploit the LLM's full potential we construct a new dataset by generating 2000 captions with highly diverse objects and let a powerful text-to-image model generate images for them. We find that we are not just able to increase the benchmark's diversity, but also improve the evaluation accuracy with respect to CHAIR's.
Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.
The Polaris dataset offers a large-scale, diverse benchmark for evaluating metrics for image captioning, surpassing existing datasets in terms of size, caption diversity, number of human judgments, and granularity of the evaluations. It includes 131,020 generated captions and 262,040 reference captions. The generated captions have a vocabulary of 3,154 unique words and the reference captions have a vocabulary of 22,275 unique words.
PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.
The Reddit Photo Critique Dataset (RPCD) contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback.
T2 Guiding is a dataset of 1000 images, each with six image labels. The images are from the Open Images Dataset (OID) and the dataset includes 2 sets of machine-generated labels for these images.
The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.
The WikiScenes dataset consists of paired images and language descriptions capturing world landmarks and cultural sites, with associated 3D models and camera poses. WikiScenes is derived from the massive public catalog of freely-licensed crowdsourced data in the Wikimedia Commons project, which contains a large variety of images with captions and other metadata.
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.
0 PAPER • NO BENCHMARKS YET