🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

363 dataset results for Question Answering

Habitat Platform

A platform for research in embodied artificial intelligence (AI).

14 PAPERS • NO BENCHMARKS YET

HeadQA

HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.

20 PAPERS • NO BENCHMARKS YET

House3D Environment

A rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.)

11 PAPERS • NO BENCHMARKS YET

HybridQA

A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and multiple free-form corpora linked with the entities in the table. The questions are designed to aggregate both tabular information and text information, i.e., lack of either form would render the question unanswerable.

55 PAPERS • 1 BENCHMARK

InsuranceQA

InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.

38 PAPERS • NO BENCHMARKS YET

KQA Pro

A large-scale dataset for Complex KBQA.

17 PAPERS • 1 BENCHMARK

KnowIT VQA

KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.

8 PAPERS • 1 BENCHMARK

LAMA

LAMA (LAnguage Model Analysis)

LAnguage Model Analysis (LAMA) consists of a set of knowledge sources, each comprised of a set of facts. LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.

205 PAPERS • NO BENCHMARKS YET

LAReQA

A challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool.

10 PAPERS • NO BENCHMARKS YET

LiveQA

A new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (https://nba.hupu.com/games) website.

4 PAPERS • NO BENCHMARKS YET

MEDIQA-AnS

MEDIQA-AnS (MEDIQA-Answer Summarization)

The first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches.

13 PAPERS • NO BENCHMARKS YET

MKQA (Multilingual Knowledge Questions and Answers)

Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.

37 PAPERS • NO BENCHMARKS YET

MMED

Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.

3 PAPERS • NO BENCHMARKS YET

MOCHA

Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.

4 PAPERS • NO BENCHMARKS YET

MRQA

The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems.

113 PAPERS • 1 BENCHMARK

ManyModalQA

Collects the data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.

5 PAPERS • NO BENCHMARKS YET

MathQA

MathQA significantly enhances the AQuA dataset with fully-specified operational programs.

97 PAPERS • 1 BENCHMARK

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.

28 PAPERS • 1 BENCHMARK

MilkQA

A question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service.

2 PAPERS • NO BENCHMARKS YET

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.

27 PAPERS • 2 BENCHMARKS

MovieFIB

MovieFIB (Movie Fill-in-the-Blank)

A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.

6 PAPERS • NO BENCHMARKS YET

MovieGraphs

Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.

12 PAPERS • NO BENCHMARKS YET

MultiReQA

MultiReQA is a cross-domain evaluation for retrieval question answering models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. MultiReQA contains the sentence boundary annotation from eight publicly available QA datasets including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, and TextbookQA. Five of these datasets, including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, contain both training and test data, and three, in cluding BioASQ, RelationExtraction, TextbookQA, contain only the test data.

2 PAPERS • NO BENCHMARKS YET

ODSQA

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 PAPERS • NO BENCHMARKS YET

OK-VQA (Outside Knowledge Visual Question Answering)

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.

263 PAPERS • 2 BENCHMARKS

OPIEC

OPIEC (Open Information Extraction Corpus)

OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.

8 PAPERS • NO BENCHMARKS YET

ORCAS

ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.

16 PAPERS • NO BENCHMARKS YET

ORConvQA

ORConvQA (Open-Retrieval Conversational Question Answering)

Enhances QuAC by adapting it to an open-retrieval setting. It is an aggregation of three existing datasets: (1) the QuAC dataset that offers information-seeking conversations, (2) the CANARD dataset that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions.

15 PAPERS • NO BENCHMARKS YET

ORKG-QA

A preliminary dataset of related tables and a corresponding set of natural language questions.

1 PAPER • NO BENCHMARKS YET

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.

30 PAPERS • 1 BENCHMARK

OneStopQA

OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.

4 PAPERS • NO BENCHMARKS YET

PEYMA

Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.

8 PAPERS • NO BENCHMARKS YET

PathVQA

PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.

33 PAPERS • 1 BENCHMARK

PolicyQA

A dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices.

11 PAPERS • NO BENCHMARKS YET

ProtoQA

ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.

9 PAPERS • NO BENCHMARKS YET

QED

QED is a linguistically principled framework for explanations in question answering. Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection := identification of a sentence implying an answer to the question referential equality := identification of noun phrases in the question and the answer sentence that refer to the same thing predicate entailment := confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away. The QED dataset is an expert-annotated dataset of QED explanations build upon a subset of the Google Natural Questions dataset.

14 PAPERS • 1 BENCHMARK

Quizbowl

Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.

8 PAPERS • 1 BENCHMARK

Qulac

A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.

18 PAPERS • NO BENCHMARKS YET

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.

53 PAPERS • 8 BENCHMARKS

RAVEN

RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.

80 PAPERS • NO BENCHMARKS YET

RELX

RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.

6 PAPERS • NO BENCHMARKS YET

ReviewQA

ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.

2 PAPERS • NO BENCHMARKS YET

RuBQ

RuBQ (Russian Knowledge Base Questions)

The first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.

4 PAPERS • NO BENCHMARKS YET

SARA

SARA (StAtutory Reasoning Assessment)

A dataset for statutory reasoning in tax law entailment and question answering.

12 PAPERS • NO BENCHMARKS YET

SQuAD-it

SQuAD-it is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.

1 PAPER • 1 BENCHMARK

SQuADShifts

Provides four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data.

8 PAPERS • 4 BENCHMARKS

ST-VQA (Scene Text Visual Question Answering)

ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.

77 PAPERS • NO BENCHMARKS YET

SberQuAD (Sberbank Question Answering Dataset)

A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community.

7 PAPERS • 1 BENCHMARK

Datasets

363 dataset results for Question Answering