A platform for research in embodied artificial intelligence (AI).
14 PAPERS • NO BENCHMARKS YET
HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.
20 PAPERS • NO BENCHMARKS YET
A rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.)
11 PAPERS • NO BENCHMARKS YET
A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and multiple free-form corpora linked with the entities in the table. The questions are designed to aggregate both tabular information and text information, i.e., lack of either form would render the question unanswerable.
55 PAPERS • 1 BENCHMARK
InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.
38 PAPERS • NO BENCHMARKS YET
A large-scale dataset for Complex KBQA.
17 PAPERS • 1 BENCHMARK
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
8 PAPERS • 1 BENCHMARK
LAnguage Model Analysis (LAMA) consists of a set of knowledge sources, each comprised of a set of facts. LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
205 PAPERS • NO BENCHMARKS YET
A challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool.
10 PAPERS • NO BENCHMARKS YET
A new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (https://nba.hupu.com/games) website.
4 PAPERS • NO BENCHMARKS YET
The first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches.
13 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.
3 PAPERS • NO BENCHMARKS YET
Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.
The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems.
113 PAPERS • 1 BENCHMARK
Collects the data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
5 PAPERS • NO BENCHMARKS YET
MathQA significantly enhances the AQuA dataset with fully-specified operational programs.
97 PAPERS • 1 BENCHMARK
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.
28 PAPERS • 1 BENCHMARK
A question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service.
2 PAPERS • NO BENCHMARKS YET
A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.
27 PAPERS • 2 BENCHMARKS
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
6 PAPERS • NO BENCHMARKS YET
Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.
12 PAPERS • NO BENCHMARKS YET
MultiReQA is a cross-domain evaluation for retrieval question answering models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. MultiReQA contains the sentence boundary annotation from eight publicly available QA datasets including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, and TextbookQA. Five of these datasets, including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, contain both training and test data, and three, in cluding BioASQ, RelationExtraction, TextbookQA, contain only the test data.
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.
263 PAPERS • 2 BENCHMARKS
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.
8 PAPERS • NO BENCHMARKS YET
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
16 PAPERS • NO BENCHMARKS YET
Enhances QuAC by adapting it to an open-retrieval setting. It is an aggregation of three existing datasets: (1) the QuAC dataset that offers information-seeking conversations, (2) the CANARD dataset that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions.
15 PAPERS • NO BENCHMARKS YET
A preliminary dataset of related tables and a corresponding set of natural language questions.
1 PAPER • NO BENCHMARKS YET
The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.
30 PAPERS • 1 BENCHMARK
OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.
33 PAPERS • 1 BENCHMARK
A dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices.
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
9 PAPERS • NO BENCHMARKS YET
QED is a linguistically principled framework for explanations in question answering. Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection := identification of a sentence implying an answer to the question referential equality := identification of noun phrases in the question and the answer sentence that refer to the same thing predicate entailment := confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away. The QED dataset is an expert-annotated dataset of QED explanations build upon a subset of the Google Natural Questions dataset.
14 PAPERS • 1 BENCHMARK
Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.
A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.
18 PAPERS • NO BENCHMARKS YET
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
53 PAPERS • 8 BENCHMARKS
RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.
80 PAPERS • NO BENCHMARKS YET
RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.
ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.
The first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.
A dataset for statutory reasoning in tax law entailment and question answering.
SQuAD-it is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.
1 PAPER • 1 BENCHMARK
Provides four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data.
8 PAPERS • 4 BENCHMARKS
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
77 PAPERS • NO BENCHMARKS YET
A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community.
7 PAPERS • 1 BENCHMARK