The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.
30 PAPERS • 1 BENCHMARK
PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.
PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed
28 PAPERS • 5 BENCHMARKS
Natural Language Decathlon Benchmark (decaNLP) is a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. The tasks as cast as question answering over a context.
27 PAPERS • NO BENCHMARKS YET
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.
26 PAPERS • 2 BENCHMARKS
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
26 PAPERS • NO BENCHMARKS YET
QuaRTz is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 conversations containing 14115 questions where 2715 of the conversations are simple conversations, and the rest 1,177 are hybrid conversations.
25 PAPERS • 2 BENCHMARKS
The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems requiring general world knowledge. The GenericsKB is the first large-scale resource containing naturally occurring generic sentences (as opposed to extracted or crowdsourced triples), and is rich in high-quality, general, semantically complete statements. Generics were primarily extracted from three large text sources, namely the Waterloo Corpus, selected parts of Simple Wikipedia, and the ARC Corpus. A filtered, high-quality subset is also available in GenericsKB-Best, containing 1,020,868 sentences.
25 PAPERS • NO BENCHMARKS YET
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.
25 PAPERS • 1 BENCHMARK
We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop.
24 PAPERS • 2 BENCHMARKS
CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). The citing context from the judicial decision serves as the prompt for the question. The answer choices are holding statements derived from citations following text in a legal decision. There are five answer choices for each citing text. The correct answer is the holding statement that corresponds to the citing text. The four incorrect answers are other holding statements.
23 PAPERS • 2 BENCHMARKS
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
23 PAPERS • 1 BENCHMARK
BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.
22 PAPERS • NO BENCHMARKS YET
FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly story narratives, covering seven types of narrative elements or relations. It can support narrative Question Generation (QG) and Narrative Question Answering (QA) tasks.
20 PAPERS • 2 BENCHMARKS
MultiDoc2Dial is a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. We aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents.
20 PAPERS • NO BENCHMARKS YET
PrOntoQA is a question-answering dataset which generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models such as GPT-3.
The WIQA dataset V1 has 39705 questions containing a perturbation and a possible effect in the context of a paragraph. The dataset is split into 29808 train questions, 6894 dev questions and 3003 test questions.
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
19 PAPERS • NO BENCHMARKS YET
A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.
19 PAPERS • 1 BENCHMARK
HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.
QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.
The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences
18 PAPERS • NO BENCHMARKS YET
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.
18 PAPERS • 1 BENCHMARK
A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams and text - 27,213 annotated diagram logic forms (literals) - 6,293 annotated text logic forms (literals)
This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.
A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.
SituatedQA is an open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context. Answers to the same question may change depending on the extralinguistic contexts (when and where the question was asked).
With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure the tweets are meaningful and contain interesting information, tweets used by journalists to write news articles are gathered. Then human annotators are asked to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, the answer are allowed to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
TECHQA is a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size – 600 training, 310 dev, and 490 evaluation question/answer pairs – thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TECHQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote—a technical document that addresses a specific technical issue.
17 PAPERS • NO BENCHMARKS YET
TopiOCQA (pronounced Tapioca) is an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. On average, a conversation in the dataset spans 13 question-answer turns and involves four topics (documents). TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history.
Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.
CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
16 PAPERS • 2 BENCHMARKS
KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.
16 PAPERS • 1 BENCHMARK
SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video QA based on 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, the dataset proposes 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.
Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.
ConvQuestions is the first realistic benchmark for conversational question answering over knowledge graphs. It contains 11,200 conversations which can be evaluated over Wikidata. They are compiled from the inputs of 70 Master crowdworkers on Amazon Mechanical Turk, with conversations from five domains: Books, Movies, Soccer, Music, and TV Series. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods. The data gathering setup was kept as natural as possible, with the annotators selecting entities of their choice from each of the five domains, and formulating the entire conversation in one session. All questions in a conversation are from the same Turker, who also provided gold answers to the questions. For suitability to knowledge graphs, questions were constrained to be objective or factoid in nature, but no other r
15 PAPERS • NO BENCHMARKS YET
The beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
With the same format as WikiHop, the MedHop dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins.
TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.
15 PAPERS • 1 BENCHMARK
AmazonQA consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, additional annotations are collected, marking each question as either answerable or unanswerable based on the available reviews.
14 PAPERS • NO BENCHMARKS YET
FreebaseQA is a data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
14 PAPERS • 2 BENCHMARKS
ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.
13 PAPERS • NO BENCHMARKS YET
Visual Madlibs is a dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context.
Who-did-What collects its corpus from news and provides options for questions similar to CBT. Each question is formed from two independent articles: an article is treated as context to be read and a separate article on the same event is used to form the query.