CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
16 PAPERS • 2 BENCHMARKS
SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video QA based on 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, the dataset proposes 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.
16 PAPERS • 1 BENCHMARK
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
LIVE-YT-HFR comprises of 480 videos having 6 different frame rates, obtained from 16 diverse contents.
13 PAPERS • 1 BENCHMARK
The VQA-CP dataset was constructed by reorganizing VQA v2 such that the correlation between the question type and correct answer differs in the training and test splits. For example, the most common answer to questions starting with What sport… is tennis in the training set, but skiing in the test set. A model that guesses an answer primarily from the question will perform poorly.
Visual Madlibs is a dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context.
13 PAPERS • NO BENCHMARKS YET
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
10 PAPERS • NO BENCHMARKS YET
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of
9 PAPERS • 1 BENCHMARK
PointQA is a set of datasets for Visual Question Datasets (VQA) that require a pointer to an object in the image to be answered correctly. The different datasets are: PointQA-Local, PointQA-LookTwice and PointQA-General.
9 PAPERS • NO BENCHMARKS YET
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
8 PAPERS • 1 BENCHMARK
A configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures.
7 PAPERS • NO BENCHMARKS YET
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
This dataset provides a new split of VQA v2 (similarly to VQA-CP v2), which is built of questions that are hard to answer for biased models.
7 PAPERS • 1 BENCHMARK
ViQuAE is a dataset for KVQAE (Knowledge-based Visual Question Answering about named Entities), a task which consists in answering questions about named entities grounded in a visual context using a Knowledge Base. It is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). We argue that KVQAE is a clear, well-defined task that can be evaluated easily, making it suitable to track the progress of multimodal entity representation’s quality. Multimodal entity representation is a central issue that will allow to make human-machine interactions more natural. For example, while watching a movie, one might wonder ‘‘Where did I already see this actress?’’ or ‘‘Did she ever win an Oscar?’’
The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers' correctness. The dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions.
6 PAPERS • NO BENCHMARKS YET
IQUAD is a dataset for Visual Question Answering in interactive environments. It is built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive object. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration.
AI2D-RST is a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowd-sourced descriptions, which was originally developed to support research on automatic diagram understanding and visual question answering.
5 PAPERS • NO BENCHMARKS YET
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs.
5 PAPERS • 1 BENCHMARK
A dataset of 69,270,581 video clip, question and answer triplets (v, q, a). HowToVQA69M is two orders of magnitude larger than any of the currently available VideoQA datasets.
LEAF-QA, a comprehensive dataset of 250,000 densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering.
TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
DocCVQA is a Document Visual Question Answering dataset, where the questions are posed over a whole collection of 14,362 scanned documents. Therefore, the task can be seen as a retrieval-style evidence seeking task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as provide the answer.
4 PAPERS • NO BENCHMARKS YET
The dataset is aimed to perform Visual Question Answering on multipage industry scanned documents. The questions and answers are reused from Single Page DocVQA (SP-DocVQA) dataset. The images also corresponds to the same in original dataset with previous and posterior pages with a limit of up to 20 pages per document.
The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.
The MCVQA dataset consists of 248, 349 training questions and 121, 512 validation questions for real images in Hindi and Code-mixed. For each Hindi question, we also provide its 10 corresponding answers in Hindi.
3 PAPERS • NO BENCHMARKS YET
Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
SciGraphQA is a large-scale, open-domain dataset focused on generating multi-turn conversational question-answering dialogues centered around understanding and describing scientific graphs and figures. It contains over 300,000 samples derived from academic research papers in computer science and machine learning domains.
3 PAPERS • 1 BENCHMARK
The current OOD benchmark VQA-CP v2 only considers one type of shortcut (from question type to answer) and thus still cannot guarantee that the modelrelies on the intended solution rather than a solution specific to this shortcut. To overcome this limitation, VQA-VS proposes a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets. In addition, VQA-VS overcomes three troubling practices in the use of VQA-CP v2, e.g., selecting models using OOD test sets, and further standardize OOD evaluation procedure. VQA-VS provides a more rigorous and comprehensive testbed for shortcut learning in VQA.
The VideoNavQA dataset contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the Embodied Question Answering (EQA) task.
CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.
2 PAPERS • NO BENCHMARKS YET
FunQA is a challenging video question answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Extensive experiments with existing VideoQA models reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.
InstaOrder can be used to understand the geometrical relationships of instances in an image. The dataset consists of 2.9M annotations of geometric orderings for class-labeled instances in 101K natural scenes. The scenes were annotated by 3,659 crowd-workers regarding (1) occlusion order that identifies occluder/occludee and (2) depth order that describes ordinal relations that consider relative distance from the camera.
Rad-ReStruct is a fine-grained structured reporting dataset for Chest X-Ray images. The structured reporting process is modeled as a hierarchical VQA task and the task is recognizing different findings in different body regions and predicting their attributes.
2 PAPERS • 1 BENCHMARK
Super-CLEVR is a dataset for Visual Question Answering (VQA) where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. It contains 21 vehicle models belonging to 5 categories, with controllable attributes. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality.
Visual Beliefs is a dataset of abstract scenes to study visual beliefs. The dataset consists of 8-frame scenes, and in each scene a person has a mistaken belief. The dataset can be used for two tasks: predicting who is mistaken and predicting when are they mistaken.
VizWiz-Priv includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content.
A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.
WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains commonsense-defying image from a wide range of reasons, deviations from expected social norms and everyday knowledge.
2 PAPERS • 4 BENCHMARKS
AesVQA is a dataset that contains 72168 high-quality images and 324756 pairs of aesthetic questions. This dataset addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks.
1 PAPER • NO BENCHMARKS YET
We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT
1 PAPER • 1 BENCHMARK
CLEVR Mental Rotation Tests (CLEVR-MRT) is a new version of the CLEVR dataset. It contains 20 images generated for each scene holding a constant altitude and sampling over azimuthal angle. It is a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint.
CORE-MM is an Open-ended VQA benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. CORE-MM benchmark consists of 279 manually curated reasoning questions, associated with a total of 342 images. The questions are divided into 3 reasoning categories--Deductive, Abductive and Analogical. 49 questions pertain to abductive reasoning, 181 require deductive reasoning, and 49 involve analogicalreasoning. Furthermore, the dataset is divided into two folds based on reasoning complexity, with 108 classified as “High” reasoning complexity and 171 as “Moderate” reasoning complexity.
ChiQA is a dataset designed for visual question answering tasks that not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. It contains more than 40K questions and more than 200K question-images pairs. The questions are real-world image-independent queries that are more various and unbiased.
Medical VQA dataset built from the IDRiD and eOphta datasets. The dataset contains both healthy and unhealthy fundus images. For each image, a set of pre-defined questions is generated, including questions about regions (e.g. are there hard exudates in this region?), for which an associated mask denotes the location of the region.
GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.
IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.
1 PAPER • 2 BENCHMARKS