🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

360 dataset results for Question Answering

ClarQ, consists of ∼2M examples distributed across 173 domains of stackexchange. This dataset is meant for training and evaluation of Clarification Question Generation Systems.

2 PAPERS • NO BENCHMARKS YET

ConcurrentQA Benchmark

ConcurrentQA is a textual multi-hop QA benchmark to require concurrent retrieval over multiple data-distributions (i.e. Wikipedia and email data). The dataset follow the exact same schema and design as HotpotQA. The data set is downloadable here: https://github.com/facebookresearch/concurrentqa. It also contains model and result analysis code. This benchmark can also be used to study privacy when reasoning over data distributed in multiple privacy scopes --- i.e. Wikipedia in the public domain and emails in the private domain.

2 PAPERS • 1 BENCHMARK

DrugEHRQA

DrugEHRQA (Electronic Health Record QA)

Contains over 70,000 question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR).

2 PAPERS • NO BENCHMARKS YET

ExpMRC

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

2 PAPERS • 4 BENCHMARKS

G-VUE (General-purpose Visual Understanding Evaluation)

General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.

2 PAPERS • NO BENCHMARKS YET

MIMIC-SPARQL

Question Answering (QA) is a widely-used framework for developing and evaluating an intelligent machine. In this light, QA on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone toward developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA.

2 PAPERS • NO BENCHMARKS YET

MilkQA

A question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service.

2 PAPERS • NO BENCHMARKS YET

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.

2 PAPERS • 1 BENCHMARK

MultiReQA

MultiReQA is a cross-domain evaluation for retrieval question answering models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. MultiReQA contains the sentence boundary annotation from eight publicly available QA datasets including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, and TextbookQA. Five of these datasets, including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, contain both training and test data, and three, in cluding BioASQ, RelationExtraction, TextbookQA, contain only the test data.

2 PAPERS • NO BENCHMARKS YET

NQuAD (Nuclear Question Answering Dataset)

NQuAD is a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers.

2 PAPERS • NO BENCHMARKS YET

ResQ (Real-world Spatial Question Answering)

ReSQ is a real-world Spatial Question Answering dataset with human-generated questions built on an existing corpus with SpRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations.

2 PAPERS • NO BENCHMARKS YET

ReviewQA

ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.

2 PAPERS • NO BENCHMARKS YET

RoMQA

RoMQA is a benchmark for robust, multi-evidence, and multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. The dataset evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster.

2 PAPERS • NO BENCHMARKS YET

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.

2 PAPERS • 1 BENCHMARK

SimpleDBpediaQA

A new benchmark dataset for simple question answering over knowledge graphs that was created by mapping SimpleQuestions entities and predicates from Freebase to DBpedia.

2 PAPERS • NO BENCHMARKS YET

Stanford Schema2QA Dataset

Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.

2 PAPERS • NO BENCHMARKS YET

Super-CLEVR

Super-CLEVR is a dataset for Visual Question Answering (VQA) where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. It contains 21 vehicle models belonging to 5 categories, with controllable attributes. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality.

2 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

Visual Beliefs

Visual Beliefs is a dataset of abstract scenes to study visual beliefs. The dataset consists of 8-frame scenes, and in each scene a person has a mistaken belief. The dataset can be used for two tasks: predicting who is mistaken and predicting when are they mistaken.

2 PAPERS • NO BENCHMARKS YET

VizWiz-Priv

VizWiz-Priv (Visual Privacy dataset)

VizWiz-Priv includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content.

2 PAPERS • NO BENCHMARKS YET

VizWiz-QualityIssues

A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.

2 PAPERS • NO BENCHMARKS YET

X-WikiRE

X-WikiRE is a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations.

2 PAPERS • NO BENCHMARKS YET

AviationQA

AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models for Link Prediction and Question Answering

1 PAPER • 1 BENCHMARK

BDD-QA

BDD-QA is distinguished by its encompassing range of traffic actions, crafted to rigorously evaluate a model's decision-making abilities in traffic scenario. This makes it a potent tool for high-level decision-making research within traffic contexts, including autonomous driving developments.

1 PAPER • NO BENCHMARKS YET

BIPIA

BIPIA (Benchmark of Indirect Prompt Injection Attacks)

Recent advancements in large language models (LLMs) have led to their adoption across various applications, notably in combining LLMs with external content to generate responses. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. We introduce the first benchmark of indirect prompt injection attack, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. We hope that our benchmark and defenses can inspire future work in this important area.

1 PAPER • NO BENCHMARKS YET

CC-Riddle (Chinese Character Riddle)

CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.

1 PAPER • NO BENCHMARKS YET

COPA-HR

The COPA-HR dataset (Choice of plausible alternatives in Croatian) is a translation of the English COPA dataset by following the XCOPA dataset translation methodology. The dataset consists of 1000 premises (My body cast a shadow over the grass), each given a question (What is the cause?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising).

1 PAPER • NO BENCHMARKS YET

CUHK-QA

CUHK-QA is a dataset for natural language-based person search using iterative questioning.

1 PAPER • NO BENCHMARKS YET

ChAII - Hindi and Tamil Question Answering

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators.

1 PAPER • 1 BENCHMARK

ChiQA (Chinese VQA)

ChiQA is a dataset designed for visual question answering tasks that not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. It contains more than 40K questions and more than 200K question-images pairs. The questions are real-world image-independent queries that are more various and unbiased.

1 PAPER • NO BENCHMARKS YET

CompMix

CompMix is a crowdsourced QA benchmark which naturally demands the integration of a mixture of input sources. CompMix has a total of 9,410 questions, and features several complex intents like joins and temporal conditions.

1 PAPER • NO BENCHMARKS YET

CoreSearch

CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.

1 PAPER • NO BENCHMARKS YET

Dialog-based Language Learning dataset

Dialog-based Language Learning dataset is designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer (as well as potentially receiving an external real-valued reward signal).

1 PAPER • NO BENCHMARKS YET

EpiK-Eval (Epistemic Knowledge Evaluation)

Benchmark to evaluate the capability of LMs to consolidate and recall information from multiple training documents.

1 PAPER • NO BENCHMARKS YET

Event-QA

Contains 1000 semantic queries and the corresponding English, German and Portuguese verbalizations for EventKG - an event-centric knowledge graph with more than 970 thousand events.

1 PAPER • NO BENCHMARKS YET

FanOutQA

FanOutQA is a high quality, multi-hop, multi-document benchmark for large language models using English Wikipedia as its knowledge base. Compared to other question-answering benchmarks, FanOutQA requires reasoning over a greater number of documents, with the benchmark's main focus being on the titular fan-out style of question. We present these questions in three tasks -- closed-book, open-book, and evidence-provided -- which measure different abilities of LLM systems.

1 PAPER • NO BENCHMARKS YET

Financial Language Understanding Evaluation

Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.

1 PAPER • NO BENCHMARKS YET

LLeQA (Long-form Legal Question Answering)

LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.

1 PAPER • NO BENCHMARKS YET

MLQuestions

MLQuestions is a domain-adaptation dataset for the machine learning domain containing 50K unaligned passages and 35K unaligned questions, and 3K aligned passage and question pairs.

1 PAPER • NO BENCHMARKS YET

NText

NText is an eight million words dataset extracted and preprocessed from nuclear research papers and thesis.

1 PAPER • NO BENCHMARKS YET

ORKG-QA

A preliminary dataset of related tables and a corresponding set of natural language questions.

1 PAPER • NO BENCHMARKS YET

PQ-decaNLP (Paraphrase Questions - decaNLP)

Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. PQ-decaNLP is a crowd-sourced corpus of paraphrased questions, annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation.

1 PAPER • NO BENCHMARKS YET

PersianQA (Persian Question Answering Dataset)

PersianQA: a dataset for Persian Question Answering Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced the dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".

1 PAPER • NO BENCHMARKS YET

Phrase-in-Context

Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

1 PAPER • NO BENCHMARKS YET

Pirá

Pirá (Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean)

A large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is a crowdsourced question answering (QA) dataset on the ocean and the Brazilian coast designed for reading comprehension.

1 PAPER • NO BENCHMARKS YET

PoseScript

PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.

1 PAPER • NO BENCHMARKS YET

QALD-9-Plus

QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.

1 PAPER • 1 BENCHMARK

QASiNa (Question Answering Sirah Nabawiyah)

Question Answering Sirah Nabawiyah (QASiNa) Dataset is a reading comprehension dataset consists of QA from Sirah Nabawiyah literature in Indonesian Language

1 PAPER • NO BENCHMARKS YET

Datasets

360 dataset results for Question Answering