🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

80 dataset results for Visual Question Answering (VQA) AND Images

IQUAD (Interactive Question Answering Dataset)

IQUAD is a dataset for Visual Question Answering in interactive environments. It is built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive object. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration.

6 PAPERS • NO BENCHMARKS YET

HallusionBench

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs.

5 PAPERS • 1 BENCHMARK

ZS-F-VQA

The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.

5 PAPERS • 1 BENCHMARK

DocCVQA (Document Collection Visual Question Answering)

DocCVQA is a Document Visual Question Answering dataset, where the questions are posed over a whole collection of 14,362 scanned documents. Therefore, the task can be seen as a retrieval-style evidence seeking task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as provide the answer.

4 PAPERS • NO BENCHMARKS YET

MP-DocVQA (Multipage Document Visual Question Answering)

The dataset is aimed to perform Visual Question Answering on multipage industry scanned documents. The questions and answers are reused from Single Page DocVQA (SP-DocVQA) dataset. The images also corresponds to the same in original dataset with previous and posterior pages with a limit of up to 20 pages per document.

4 PAPERS • NO BENCHMARKS YET

VizWiz-VQA-Grounding

The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.

4 PAPERS • NO BENCHMARKS YET

MCVQA (Multilingual and Code-mixed Visual Question Answering)

The MCVQA dataset consists of 248, 349 training questions and 121, 512 validation questions for real images in Hindi and Code-mixed. For each Hindi question, we also provide its 10 corresponding answers in Hindi.

3 PAPERS • NO BENCHMARKS YET

PDFVQA

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

3 PAPERS • NO BENCHMARKS YET

SciGraphQA

SciGraphQA is a large-scale, open-domain dataset focused on generating multi-turn conversational question-answering dialogues centered around understanding and describing scientific graphs and figures. It contains over 300,000 samples derived from academic research papers in computer science and machine learning domains.

3 PAPERS • 1 BENCHMARK

VQA-VS (a new VQA benchmark considering Varying Shortcuts)

The current OOD benchmark VQA-CP v2 only considers one type of shortcut (from question type to answer) and thus still cannot guarantee that the modelrelies on the intended solution rather than a solution specific to this shortcut. To overcome this limitation, VQA-VS proposes a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets. In addition, VQA-VS overcomes three troubling practices in the use of VQA-CP v2, e.g., selecting models using OOD test sets, and further standardize OOD evaluation procedure. VQA-VS provides a more rigorous and comprehensive testbed for shortcut learning in VQA.

3 PAPERS • NO BENCHMARKS YET

CLEVR-Math

CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.

2 PAPERS • NO BENCHMARKS YET

Rad-ReStruct

Rad-ReStruct is a fine-grained structured reporting dataset for Chest X-Ray images. The structured reporting process is modeled as a hierarchical VQA task and the task is recognizing different findings in different body regions and predicting their attributes.

2 PAPERS • 1 BENCHMARK

Visual Beliefs

Visual Beliefs is a dataset of abstract scenes to study visual beliefs. The dataset consists of 8-frame scenes, and in each scene a person has a mistaken belief. The dataset can be used for two tasks: predicting who is mistaken and predicting when are they mistaken.

2 PAPERS • NO BENCHMARKS YET

VizWiz-QualityIssues

A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.

2 PAPERS • NO BENCHMARKS YET

WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.

2 PAPERS • 4 BENCHMARKS

AesVQA

AesVQA is a dataset that contains 72168 high-quality images and 324756 pairs of aesthetic questions. This dataset addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks.

1 PAPER • NO BENCHMARKS YET

AlgoPuzzleVQA

We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT

1 PAPER • 1 BENCHMARK

CLEVR-MRT (CLEVR: Mental Rotation Tests)

CLEVR Mental Rotation Tests (CLEVR-MRT) is a new version of the CLEVR dataset. It contains 20 images generated for each scene holding a constant altitude and sampling over azimuthal angle. It is a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint.

1 PAPER • NO BENCHMARKS YET

ChiQA (Chinese VQA)

ChiQA is a dataset designed for visual question answering tasks that not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. It contains more than 40K questions and more than 200K question-images pairs. The questions are real-world image-independent queries that are more various and unbiased.

1 PAPER • NO BENCHMARKS YET

DME VQA dataset (Diabetic Macular Edema VQA dataset)

Medical VQA dataset built from the IDRiD and eOphta datasets. The dataset contains both healthy and unhealthy fundus images. For each image, a set of pre-defined questions is generated, including questions about regions (e.g. are there hard exudates in this region?), for which an associated mask denotes the location of the region.

1 PAPER • NO BENCHMARKS YET

GQA-OOD

GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.

1 PAPER • NO BENCHMARKS YET

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.

1 PAPER • 2 BENCHMARKS

MAVERICS

Manually vAlidated Vq2a Examples fRom Image/Caption datasetS (MAVERICS) is a suite of test-only visual question answering datasets.

1 PAPER • NO BENCHMARKS YET

ParsVQA-Caps

Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.

1 PAPER • NO BENCHMARKS YET

QLEVR

Synthetic datasets have successfully been used to probe visual question-answering datasets for their reasoning abilities. CLEVR, for example, tests a range of visual reasoning abilities. The questions in CLEVR focus on comparisons of shapes, colors, and sizes, numerical reasoning, and existence claims. This paper introduces a minimally biased, diagnostic visual question-answering dataset, QLEVR, that goes beyond existential and numerical quantification and focus on more complex quantifiers and their combinations, e.g., asking whether there are more than two red balls that are smaller than at least three blue balls in an image. We describe how the dataset was created and present a first evaluation of state-of-the-art visual question-answering models, showing that QLEVR presents a formidable challenge to our current models.

1 PAPER • 1 BENCHMARK

Super-CLEVR-3D

Super-CLEVR-3D is a visual question answering (VQA) dataset where the questions are about the explicit 3D configuration of the objects from images (i.e. 3D poses, parts, and occlusion). It consists of objects from 5 categories: aeroplanes, buses, bicycles, cars and motorbikes. The rendered objects are from CGParts dataset, with the same setting as Super-CLEVR dataset.

1 PAPER • NO BENCHMARKS YET

T2 Guiding

T2 Guiding is a dataset of 1000 images, each with six image labels. The images are from the Open Images Dataset (OID) and the dataset includes 2 sets of machine-generated labels for these images.

1 PAPER • NO BENCHMARKS YET

TinySocial

TinySocial is a dataset to enable research on Social Visual Question Answering.

1 PAPER • NO BENCHMARKS YET

VDQG (Visual Discriminative Question Generation)

The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.

1 PAPER • NO BENCHMARKS YET

VQA 360°

VQA 360° is a dataset for visual question answering on 360° images containing around 17,000 real-world image-question-answer triplets for a variety of question types.

1 PAPER • NO BENCHMARKS YET

VQA-MHUG

VQA-MHUG is a 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker.

1 PAPER • NO BENCHMARKS YET

simply-CLEVR

The simply-CLEVR dataset aims to provide a benchmark dataset that can be used for transparent quantitative evaluation of explanation methods (aka heatmaps/XAI methods). It is made of simple Visual Question Answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by two Ground Truth Masks that serve as a basis for evaluating explanations on the input image.

1 PAPER • NO BENCHMARKS YET

Datasets

80 dataset results for Visual Question Answering (VQA) AND Images