The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
1,208 PAPERS • 10 BENCHMARKS
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.
381 PAPERS • 7 BENCHMARKS
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
269 PAPERS • 3 BENCHMARKS
SearchQA was built using an in-production, commercial search engine. It closely reflects the full pipeline of a (hypothetical) general question-answering system, which consists of information retrieval and answer synthesis.
93 PAPERS • 1 BENCHMARK
LAnguage Model Analysis (LAMA) consists of a set of knowledge sources, each comprised of a set of facts. LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
83 PAPERS • NO BENCHMARKS YET
DuReader is a large-scale open-domain Chinese machine reading comprehension dataset. The dataset consists of 200K questions, 420K answers and 1M documents. The questions and documents are based on Baidu Search and Baidu Zhidao. The answers are manually generated. The dataset additionally provides question type annotations – each question was manually annotated as either Entity, Description or YesNo and one of Fact or Opinion.
43 PAPERS • 4 BENCHMARKS
KILT (Knowledge Intensive Language Tasks) is a benchmark consisting of 11 datasets representing 5 types of tasks:
42 PAPERS • 11 BENCHMARKS
QUASAR-T is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 43,013 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. The answers to these questions are free-form spans of text, though most are noun phrases.
42 PAPERS • 2 BENCHMARKS
The Question Answering by Search And Reading (QUASAR) is a large-scale dataset consisting of QUASAR-S and QUASAR-T. Each of these datasets is built to focus on evaluating systems devised to understand a natural language query, a large corpus of texts and to extract an answer to the question from the corpus. Specifically, QUASAR-S comprises 37,012 fill-in-the-gaps questions that are collected from the popular website Stack Overflow using entity tags. The QUASAR-T dataset contains 43,012 open-domain questions collected from various internet sources. The candidate documents for each question in this dataset are retrieved from an Apache Lucene based search engine built on top of the ClueWeb09 dataset.
34 PAPERS • 1 BENCHMARK
WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),
32 PAPERS • NO BENCHMARKS YET
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.
22 PAPERS • NO BENCHMARKS YET
The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.
21 PAPERS • 1 BENCHMARK
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
12 PAPERS • NO BENCHMARKS YET
Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).
11 PAPERS • NO BENCHMARKS YET
QReCC contains 14K conversations with 81K question-answer pairs. QReCC is built on questions from TREC CAsT, QuAC and Google Natural Questions. While TREC CAsT and QuAC datasets contain multi-turn conversations, Natural Questions is not a conversational dataset. Questions in NQ dataset were used as prompts to create conversations explicitly balancing types of context-dependent questions, such as anaphora (co-references) and ellipsis.
8 PAPERS • NO BENCHMARKS YET
XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.
7 PAPERS • NO BENCHMARKS YET