The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
1,267 PAPERS • 7 BENCHMARKS
General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
889 PAPERS • 14 BENCHMARKS
CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; each image comes with a number of highly compositional questions that fall into different categories. Those categories fall into 5 classes of tasks: Exist, Count, Compare Integer, Query Attribute and Compare Attribute. The CLEVR dataset consists of: a training set of 70k images and 700k questions, a validation set of 15k images and 150k questions, A test set of 15k images and 150k questions about objects, answers, scene graphs and functional programs for all train and validation images and questions. Each object present in the scene, aside of position, is characterized by a set of four attributes: 2 sizes: large, small, 3 shapes: square, cylinder, sphere, 2 material types: rubber, metal, 8 color types: gray, blue, brown, yellow, red, green, purple, cyan, resulting in 96 unique combinations.
528 PAPERS • 2 BENCHMARKS
The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.
383 PAPERS • 2 BENCHMARKS
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
374 PAPERS • 2 BENCHMARKS
ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use.
325 PAPERS • NO BENCHMARKS YET
The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs.
262 PAPERS • 1 BENCHMARK
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.
258 PAPERS • 3 BENCHMARKS
The bAbI dataset is a textual QA benchmark composed of 20 different tasks. Each task is designed to test a different reasoning skill, such as deduction, induction, and coreference resolution. Some of the tasks need relational reasoning, for instance, to compare the size of different entities. Each sample is composed of a question, an answer, and a set of facts. There are two versions of the dataset, referring to different dataset sizes: bAbI-1k and bAbI-10k. The bAbI-10k version of the dataset consists of 10,000 training samples per task.
253 PAPERS • 1 BENCHMARK
HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to require the introduction paragraphs of two Wikipedia articles to answer. Each question in the dataset comes with the two gold paragraphs, as well as a list of sentences in these paragraphs that crowdworkers identify as supporting facts necessary to answer the question.
252 PAPERS • 1 BENCHMARK
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.
252 PAPERS • 4 BENCHMARKS
MCTest is a freely available set of stories and associated questions intended for research on the machine comprehension of text.
248 PAPERS • 2 BENCHMARKS
The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, Bing query logs were used as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, sentences in this section were used as the candidate answers. The corpus includes 3,047 questions and 29,258 sentences, where 1,473 sentences were labeled as answer sentences to their corresponding questions.
237 PAPERS • 1 BENCHMARK
CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.
224 PAPERS • 2 BENCHMARKS
CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.
194 PAPERS • 5 BENCHMARKS
SuperGLUE is a benchmark dataset designed to pose a more rigorous test of language understanding than GLUE. SuperGLUE has the same high-level motivation as GLUE: to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. SuperGLUE follows the basic design of GLUE: It consists of a public leaderboard built around eight language understanding tasks, drawing on existing data, accompanied by a single-number performance metric, and an analysis toolkit. However, it improves upon GLUE in several ways:
187 PAPERS • 7 BENCHMARKS
Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer pairs in total. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.
158 PAPERS • 1 BENCHMARK
The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers.
147 PAPERS • 1 BENCHMARK
Quora Question Pair consists of over 400k question pairs based on actual quora.com questions. Each pair contains a binary value indicating whether the two questions are paraphrase or not. The training-dev-test splits for this dataset are provided in.
147 PAPERS • 4 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
146 PAPERS • NO BENCHMARKS YET
GuessWhat?! is a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images.
142 PAPERS • NO BENCHMARKS YET
The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. The original split uses 3,778 examples for training and 2,032 for testing. All answers are defined as Freebase entities.
141 PAPERS • 1 BENCHMARK
The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. The dataset was generated by Amazon Mechanical Turk workers in the following process (an example is provided in parentheses):
129 PAPERS • 2 BENCHMARKS
Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. The questions consist of passages extracted from Wikipedia articles. The dataset is split into a training set of about 77,000 questions, a development set of around 9,500 questions and a hidden test set similar in size to the development set.
123 PAPERS • 1 BENCHMARK
SHAPES is a dataset of synthetic images designed to benchmark systems for understanding of spatial and logical relations among multiple objects. The dataset consists of complex questions about arrangements of colored shapes. The questions are built around compositions of concepts and relations, e.g. Is there a red shape above a circle? or Is a red shape blue?. Questions contain between two and four attributes, object types, or relationships. There are 244 questions and 15,616 images in total, with all questions having a yes and no answer (and corresponding supporting image). This eliminates the risk of learning biases.
120 PAPERS • 1 BENCHMARK
SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding fact from Freebase knowledge base. Each fact is a triple (subject, relation, object) and the answer to the question is always the object. The dataset is divided into training, validation, and test sets with 75,910, 10,845 and 21,687 questions respectively.
99 PAPERS • 1 BENCHMARK
Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.
94 PAPERS • 2 BENCHMARKS
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
85 PAPERS • 1 BENCHMARK
ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday topics created by Amazon Mechanical Turk workers. These stories contain a variety of commonsense causal and temporal relations between everyday events. Writers also develop an additional 3,742 Story Cloze Test stories which contain a four-sentence-long body and two candidate endings. The endings were collected by asking Mechanical Turk workers to write both a right ending and a wrong ending after eliminating original endings of given short stories. Both endings were required to make logical sense and include at least one character from the main story line. The published ROCStories dataset is constructed with ROCStories as a training set that includes 98,162 stories that exclude candidate wrong endings, an evaluation set, and a test set, which have the same structure (1 body + 2 candidate endings) and a size of 1,871.
79 PAPERS • 2 BENCHMARKS
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.
79 PAPERS • 4 BENCHMARKS
CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
75 PAPERS • NO BENCHMARKS YET
ATOMIC is an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment").
74 PAPERS • NO BENCHMARKS YET
The Question Answering by Search And Reading (QUASAR) is a large-scale dataset consisting of QUASAR-S and QUASAR-T. Each of these datasets is built to focus on evaluating systems devised to understand a natural language query, a large corpus of texts and to extract an answer to the question from the corpus. Specifically, QUASAR-S comprises 37,012 fill-in-the-gaps questions that are collected from the popular website Stack Overflow using entity tags. The QUASAR-T dataset contains 43,012 open-domain questions collected from various internet sources. The candidate documents for each question in this dataset are retrieved from an Apache Lucene based search engine built on top of the ClueWeb09 dataset.
72 PAPERS • 1 BENCHMARK
The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. COPA consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%.
71 PAPERS • 1 BENCHMARK
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.
67 PAPERS • NO BENCHMARKS YET
WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting documents are from WikiReading. A bipartite graph connecting entities and documents is first built and the answer for each query is located by traversal on this graph. Candidates that are type-consistent with the answer and share the same relation in query with the answer are included, resulting in a set of candidates. Thus, WikiHop is a multi-choice style reading comprehension data set. There are totally about 43K samples in training set, 5K samples in development set and 2.5K samples in test set. The test set is not provided. The task is to predict the correct answer given a query and multiple supporting documents.
65 PAPERS • 1 BENCHMARK
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
63 PAPERS • NO BENCHMARKS YET
MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by combining information from multiple sentences of the paragraph. The dataset was designed with three key challenges in mind: * The number of correct answer-options for each question is not pre-specified. This removes the over-reliance on answer-options and forces them to decide on the correctness of each candidate answer independently of others. In other words, the task is not to simply identify the best answer-option, but to evaluate the correctness of each answer-option individually. * The correct answer(s) is not required to be a span in the text. * The paragraphs in the dataset have diverse provenance by being extracted from 7 different domains such as news, fiction, historical text etc., and hence are expected to be more diverse in their contents as compared to single-domain datasets. The entire corpus consists of around 10K questions
62 PAPERS • 1 BENCHMARK
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
62 PAPERS • NO BENCHMARKS YET
COCO-QA is a dataset for visual question answering. It consists of:
61 PAPERS • NO BENCHMARKS YET
The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. The question & answer pairs are collected via crowdsourcing with a carefully designed user interface to ensure quality. The dataset can be used to evaluate video-based Visual Question Answering techniques.
61 PAPERS • NO BENCHMARKS YET
DAQUAR (DAtaset for QUestion Answering on Real-world images) is a dataset of human question answer pairs about images.
60 PAPERS • NO BENCHMARKS YET
e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.
59 PAPERS • NO BENCHMARKS YET
HoME (Household Multimodal Environment) is a multimodal environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more.
57 PAPERS • NO BENCHMARKS YET
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
54 PAPERS • 1 BENCHMARK
CosmosQA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.
53 PAPERS • NO BENCHMARKS YET
The Semantic Scholar corpus (S2) is composed of titles from scientific papers published in machine learning conferences and journals from 1985 to 2017, split by year (33 timesteps).
53 PAPERS • NO BENCHMARKS YET
TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.
53 PAPERS • 2 BENCHMARKS