LEAF-QA, a comprehensive dataset of 250,000 densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering.
5 PAPERS • NO BENCHMARKS YET
Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).
Collects the data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
QAConv is a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool.
Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of RuCoS is to evaluate a machine`s ability of commonsense reasoning in reading comprehension.
5 PAPERS • 1 BENCHMARK
TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.
WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.
AIT-QA is a dataset for Table Question Answering (Table-QA) which is specific to the airline industry. The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings of major airline companies for the fiscal years 2017-2019. It also contains annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms.
4 PAPERS • NO BENCHMARKS YET
ConvRef is a conversational QA benchmark with reformulations. It consists of around 11k natural conversations with about 205k reformulations. ConvRef builds upon the conversational KG-QA benchmark ConvQuestions. Questions come from five different domains: books, movies, music, TV series and soccer and answers are Wikidata entities. We used conversation sessions in ConvQuestions as input to our user study. Study participants interacted with a baseline QA system, that was trained using the available paraphrases in ConvQuestions as proxies for reformulations. Users were shown follow-up questions in a given conversation interactively, one after the other, along with the answer coming from the baseline QA system. For wrong answers, the user was prompted to reformulate the question up to four times if needed. In this way, users were able to pose reformulations based on previous wrong answers and the conversation history.
DiSCQ is a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. This dataset is released to facilitate further research into realistic clinical Question Answering (QA) and Question Generation (QG).
IndoNLG is a benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.
KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .
4 PAPERS • 1 BENCHMARK
Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.
OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.
Persian Question Answering Dataset (PQuAD) is a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.
Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.
The first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.
AfriQA is a cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages, where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language. The dataset enables the development of more equitable QA technology.
3 PAPERS • NO BENCHMARKS YET
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.
This is a medical multiple-choice dataset with explanations which can be used to interpret the answer. The data comes from Chinese Pharmacist Examination. Each piece of data has a question, five options, a gold_answer and a gold_explanation.
ForecastQA is a question-answering dataset consisting of 10,392 event forecasting questions, which have been collected and verified via crowdsourcing efforts. The forecasting problem for this dataset is formulated as a restricted-domain, multiple-choice, question-answering (QA) task that simulates the forecasting scenario.
GHOSTS is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. It a collection of multiple datasets of prompts, totalling 728 prompts, for which ChatGPT’s output was manually rated by experts.
JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.
3 PAPERS • 1 BENCHMARK
A new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (https://nba.hupu.com/games) website.
Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.
MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.
3 PAPERS • 6 BENCHMARKS
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please describe the molecule. We remove molecules that cannot be processed by RDKit [Landrum et al., 2021] to generate 2D molecular graphs. We also remove texts with less than 4 words, and crops descriptions with more than 256 words. Finally, we obtain 325, 754 unique molecules and 365, 129 molecule-text pairs. On average, each text description contains 17 words.
SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model to fill up multiple blanks in a passage from a shared candidate set with distractors designed by English teachers.
Shmoop Corpus is a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, a set of common NLP tasks are constructed, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories.
SpaRTUN a dataset synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL).
The dataset comprises 4,500 question-answer pairs collected from trusted medical sources, with at least one answer and at most four unique paraphrased answers per question
UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists 4 types of questions with regard to functions, official names, protein families, and sub-cellular locations. We collect a total of 569, 516 proteins and 1, 891, 506 question-answering samples.
The VideoNavQA dataset contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the Embodied Question Answering (EQA) task.
A large-scale dataset built on questions from TyDi QA lacking same-language answers.
We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none dependin
2 PAPERS • NO BENCHMARKS YET
A comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others.
Almawave-SLU is the first Italian dataset for Spoken Language Understanding (SLU). It is derived through a semi-automatic procedure and is used as a benchmark of various open source and commercial systems.
CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.
CS1QA is a dataset for code-based question answering in the programming education domain. It consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python, and 17,698 unannotated chat data with code.
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.
2 PAPERS • 1 BENCHMARK