🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

26 dataset results for Open-Domain Question Answering

SQuAD (Stanford Question Answering Dataset)

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

1,906 PAPERS • 16 BENCHMARKS

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.

986 PAPERS • 9 BENCHMARKS

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.

616 PAPERS • 5 BENCHMARKS

LAMA

LAMA (LAnguage Model Analysis)

LAnguage Model Analysis (LAMA) consists of a set of knowledge sources, each comprised of a set of facts. LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.

204 PAPERS • NO BENCHMARKS YET

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. The original split uses 3,778 examples for training and 2,032 for testing. All answers are defined as Freebase entities.

199 PAPERS • 3 BENCHMARKS

ScienceQA (Science Question Answering)

Science Question Answering (ScienceQA) is a new benchmark that consists of 21,208 multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. Out of the questions in ScienceQA, 10,332 (48.7%) have an image context, 10,220 (48.2%) have a text context, and 6,532 (30.8%) have both. Most questions are annotated with grounded lectures (83.9%) and detailed explanations (90.5%). The lecture and explanation provide general external knowledge and specific reasons, respectively, for arriving at the correct answer. To the best of our knowledge, ScienceQA is the first large-scale multimodal dataset that annotates lectures and explanations for the answers.

138 PAPERS • 1 BENCHMARK

SearchQA

SearchQA was built using an in-production, commercial search engine. It closely reflects the full pipeline of a (hypothetical) general question-answering system, which consists of information retrieval and answer synthesis.

126 PAPERS • 1 BENCHMARK

ELI5

ELI5 is a dataset for long-form question answering. It contains 270K complex, diverse questions that require explanatory multi-sentence answers. Web search results are used as evidence documents to answer each question.

117 PAPERS • 2 BENCHMARKS

KILT (KILT Benchmark)

KILT (Knowledge Intensive Language Tasks) is a benchmark consisting of 11 datasets representing 5 types of tasks:

100 PAPERS • 12 BENCHMARKS

DuReader

DuReader is a large-scale open-domain Chinese machine reading comprehension dataset. The dataset consists of 200K questions, 420K answers and 1M documents. The questions and documents are based on Baidu Search and Baidu Zhidao. The answers are manually generated. The dataset additionally provides question type annotations – each question was manually annotated as either Entity, Description or YesNo and one of Fact or Opinion.

61 PAPERS • 4 BENCHMARKS

QUASAR-T (QUestion Answering by Search And Reading – Trivia)

QUASAR-T is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 43,013 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. The answers to these questions are free-form spans of text, though most are noun phrases.

53 PAPERS • 2 BENCHMARKS

QUASAR (QUestion Answering by Search And Reading)

The Question Answering by Search And Reading (QUASAR) is a large-scale dataset consisting of QUASAR-S and QUASAR-T. Each of these datasets is built to focus on evaluating systems devised to understand a natural language query, a large corpus of texts and to extract an answer to the question from the corpus. Specifically, QUASAR-S comprises 37,012 fill-in-the-gaps questions that are collected from the popular website Stack Overflow using entity tags. The QUASAR-T dataset contains 43,012 open-domain questions collected from various internet sources. The candidate documents for each question in this dataset are retrieved from an Apache Lucene based search engine built on top of the ClueWeb09 dataset.

45 PAPERS • 1 BENCHMARK

QReCC

QReCC contains 14K conversations with 81K question-answer pairs. QReCC is built on questions from TREC CAsT, QuAC and Google Natural Questions. While TREC CAsT and QuAC datasets contain multi-turn conversations, Natural Questions is not a conversational dataset. Questions in NQ dataset were used as prompts to create conversations explicitly balancing types of context-dependent questions, such as anaphora (co-references) and ellipsis.

43 PAPERS • NO BENCHMARKS YET

WikiMovies

WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),

39 PAPERS • NO BENCHMARKS YET

BREAK

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

36 PAPERS • NO BENCHMARKS YET

MKQA (Multilingual Knowledge Questions and Answers)

Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.

36 PAPERS • NO BENCHMARKS YET

TQA (Textbook Question Answering)

The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.

32 PAPERS • 1 BENCHMARK

ARCD

Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).

20 PAPERS • NO BENCHMARKS YET

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

16 PAPERS • 2 BENCHMARKS

QAMPARI

QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.

11 PAPERS • NO BENCHMARKS YET

OVEN (Open-domain Visual Entity Recognition)

In this project, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels.

10 PAPERS • 1 BENCHMARK

XQA

XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.

6 PAPERS • NO BENCHMARKS YET

CREPE

CREPE is QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. It consists of 8400 Reddit questions with (1) whether there is any false presuppositions annotated, and (2) if any, the presupposition and its correction written.

2 PAPERS • NO BENCHMARKS YET

ConcurrentQA Benchmark

ConcurrentQA is a textual multi-hop QA benchmark to require concurrent retrieval over multiple data-distributions (i.e. Wikipedia and email data). The dataset follow the exact same schema and design as HotpotQA. The data set is downloadable here: https://github.com/facebookresearch/concurrentqa. It also contains model and result analysis code. This benchmark can also be used to study privacy when reasoning over data distributed in multiple privacy scopes --- i.e. Wikipedia in the public domain and emails in the private domain.

2 PAPERS • 1 BENCHMARK

CoreSearch

CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.

1 PAPER • NO BENCHMARKS YET

WikiQAar

WikiQAar (English-Arabic Wikipedia Question-Answering)

A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

1 PAPER • NO BENCHMARKS YET

Datasets

26 dataset results for Open-Domain Question Answering