🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

360 dataset results for Question Answering

LEAF-QA, a comprehensive dataset of 250,000 densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering.

5 PAPERS • NO BENCHMARKS YET

MATINF (Maternal and Infant Dataset)

Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).

5 PAPERS • NO BENCHMARKS YET

ManyModalQA

Collects the data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.

5 PAPERS • NO BENCHMARKS YET

QAConv

QAConv is a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool.

5 PAPERS • NO BENCHMARKS YET

RuCoS

RuCoS (Russian Reading Comprehension with Commonsense Reasoning)

Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of RuCoS is to evaluate a machine`s ability of commonsense reasoning in reading comprehension.

5 PAPERS • 1 BENCHMARK

TutorialVQA

TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.

5 PAPERS • NO BENCHMARKS YET

V2C

V2C (Video-to-Commonsense)

Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.

5 PAPERS • NO BENCHMARKS YET

WebCPM

WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.

5 PAPERS • NO BENCHMARKS YET

WikiHowQA

WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.

5 PAPERS • NO BENCHMARKS YET

AIT-QA (Airline Industry Table QA)

AIT-QA is a dataset for Table Question Answering (Table-QA) which is specific to the airline industry. The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings of major airline companies for the fiscal years 2017-2019. It also contains annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms.

4 PAPERS • NO BENCHMARKS YET

ConvRef

ConvRef is a conversational QA benchmark with reformulations. It consists of around 11k natural conversations with about 205k reformulations. ConvRef builds upon the conversational KG-QA benchmark ConvQuestions. Questions come from five different domains: books, movies, music, TV series and soccer and answers are Wikidata entities. We used conversation sessions in ConvQuestions as input to our user study. Study participants interacted with a baseline QA system, that was trained using the available paraphrases in ConvQuestions as proxies for reformulations. Users were shown follow-up questions in a given conversation interactively, one after the other, along with the answer coming from the baseline QA system. For wrong answers, the user was prompted to reformulate the question up to four times if needed. In this way, users were able to pose reformulations based on previous wrong answers and the conversation history.

4 PAPERS • NO BENCHMARKS YET

DiSCQ (Discharge Summary Clinical Questions)

DiSCQ is a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. This dataset is released to facilitate further research into realistic clinical Question Answering (QA) and Question Generation (QG).

4 PAPERS • NO BENCHMARKS YET

IndoNLG

IndoNLG is a benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.

4 PAPERS • NO BENCHMARKS YET

KG20C

KG20C (A scholarly knowledge graph benchmark dataset)

KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .

4 PAPERS • 1 BENCHMARK

MOCHA

Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.

4 PAPERS • NO BENCHMARKS YET

OneStopQA

OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.

4 PAPERS • NO BENCHMARKS YET

PQuAD

PQuAD (Persian Question Answering Dataset)

Persian Question Answering Dataset (PQuAD) is a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.

4 PAPERS • NO BENCHMARKS YET

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.

4 PAPERS • NO BENCHMARKS YET

RuBQ

RuBQ (Russian Knowledge Base Questions)

The first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.

4 PAPERS • NO BENCHMARKS YET

AfriQA

AfriQA is a cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages, where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language. The dataset enables the development of more equitable QA technology.

3 PAPERS • NO BENCHMARKS YET

CCPE-M

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

3 PAPERS • NO BENCHMARKS YET

CHQ-Summ (Consumer Healthcare Question Summarization)

Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.

3 PAPERS • NO BENCHMARKS YET

COVID-Q

COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.

3 PAPERS • NO BENCHMARKS YET

DICE: a Dataset of Italian Crime Event news

DICE: a Dataset of Italian Crime Event news (from Gazzetta di Modena [2011-2021])

The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.

3 PAPERS • NO BENCHMARKS YET

Disfl-QA

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

3 PAPERS • NO BENCHMARKS YET

ExplainCPE

This is a medical multiple-choice dataset with explanations which can be used to interpret the answer. The data comes from Chinese Pharmacist Examination. Each piece of data has a question, five options, a gold_answer and a gold_explanation.

3 PAPERS • NO BENCHMARKS YET

ForecastQA

ForecastQA is a question-answering dataset consisting of 10,392 event forecasting questions, which have been collected and verified via crowdsourcing efforts. The forecasting problem for this dataset is formulated as a restricted-domain, multiple-choice, question-answering (QA) task that simulates the forecasting scenario.

3 PAPERS • NO BENCHMARKS YET

GHOSTS

GHOSTS is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. It a collection of multiple datasets of prompts, totalling 728 prompts, for which ChatGPT’s output was manually rated by experts.

3 PAPERS • NO BENCHMARKS YET

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.

3 PAPERS • 1 BENCHMARK

LiveQA

A new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (https://nba.hupu.com/games) website.

3 PAPERS • NO BENCHMARKS YET

MMED

Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.

3 PAPERS • NO BENCHMARKS YET

MuLD (Multitask Long Document Benchmark)

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

3 PAPERS • 6 BENCHMARKS

ODSQA

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 PAPERS • NO BENCHMARKS YET

PDFVQA

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

3 PAPERS • NO BENCHMARKS YET

PubChemQA

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please describe the molecule. We remove molecules that cannot be processed by RDKit [Landrum et al., 2021] to generate 2D molecular graphs. We also remove texts with less than 4 words, and crops descriptions with more than 256 words. Finally, we obtain 325, 754 unique molecules and 365, 129 molecule-text pairs. On average, each text description contains 17 words.

3 PAPERS • 1 BENCHMARK

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model to fill up multiple blanks in a passage from a shared candidate set with distractors designed by English teachers.

3 PAPERS • 1 BENCHMARK

Shmoop Corpus

Shmoop Corpus is a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, a set of common NLP tasks are constructed, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories.

3 PAPERS • NO BENCHMARKS YET

SpaRTUN

SpaRTUN a dataset synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL).

3 PAPERS • NO BENCHMARKS YET

UIT-ViCoV19QA

The dataset comprises 4,500 question-answer pairs collected from trusted medical sources, with at least one answer and at most four unique paraphrased answers per question

3 PAPERS • NO BENCHMARKS YET

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists 4 types of questions with regard to functions, official names, protein families, and sub-cellular locations. We collect a total of 569, 516 proteins and 1, 891, 506 question-answering samples.

3 PAPERS • 1 BENCHMARK

VideoNavQA

The VideoNavQA dataset contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the Embodied Question Answering (EQA) task.

3 PAPERS • NO BENCHMARKS YET

XOR-TYDI QA

A large-scale dataset built on questions from TyDi QA lacking same-language answers.

3 PAPERS • NO BENCHMARKS YET

AWS Documentation

We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none dependin

2 PAPERS • NO BENCHMARKS YET

Allegro Reviews

A comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others.

2 PAPERS • NO BENCHMARKS YET

Almawave-SLU

Almawave-SLU is the first Italian dataset for Spoken Language Understanding (SLU). It is derived through a semi-automatic procedure and is used as a benchmark of various open source and commercial systems.

2 PAPERS • NO BENCHMARKS YET

CLEVR-Math

CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.

2 PAPERS • NO BENCHMARKS YET

CS1QA

CS1QA is a dataset for code-based question answering in the programming education domain. It consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python, and 17,698 unannotated chat data with code.

2 PAPERS • NO BENCHMARKS YET

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.

2 PAPERS • 1 BENCHMARK

Datasets

360 dataset results for Question Answering