The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
10 PAPERS • 1 BENCHMARK
The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.
1 PAPER • NO BENCHMARKS YET
The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail.
1 PAPER • 1 BENCHMARK
The HELP dataset is an automatically created natural language inference (NLI) dataset that embodies the combination of lexical and logical inferences focusing on monotonicity (i.e., phrase replacement-based reasoning). The HELP (Ver.1.0) has 36K inference pairs consisting of upward monotone, downward monotone, non-monotone, conjunction, and disjunction.
28 PAPERS • 1 BENCHMARK
HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.
20 PAPERS • NO BENCHMARKS YET
An IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of >25k semiautomatically generated sentence pairs illustrating well-studied pragmatic inference types.
5 PAPERS • NO BENCHMARKS YET
InfoTabS comprises of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.
13 PAPERS • NO BENCHMARKS YET
KorNLI is a Korean Natural Language Inference (NLI) dataset. The dataset is constructed by automatically translating the training sets of the SNLI, XNLI and MNLI datasets. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. It contains 942,854 training examples translated automatically and 7,500 evaluation (development and test) examples translated manually
18 PAPERS • NO BENCHMARKS YET
KorSTS is a dataset for semantic textural similarity (STS) in Korean. The dataset is constructed by automatically the STS-B dataset. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. The KorSTS dataset comprises 5,749 training examples translated automatically and 2,879 evaluation examples translated manually.
MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and collected from linguistics publications. The dataset was constructed by collecting naturally-occurring examples by crowdsourcing and well-designed ones from linguistics publications. It consists of 5,382 examples.
18 PAPERS • 1 BENCHMARK
Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is labelled if it is a paraphrase or not by human annotators. The whole set is divided into a training subset (4,076 sentence pairs of which 2,753 are paraphrases) and a test subset (1,725 pairs of which 1,147 are paraphrases).
699 PAPERS • 5 BENCHMARKS
The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical notes annotated for Definitely True, Maybe True and Definitely False. The dataset contains 11,232 training, 1,395 development and 1,422 test instances. This provides a natural language inference task (NLI) grounded in the medical history of patients.
8 PAPERS • 2 BENCHMARKS
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
23 PAPERS • NO BENCHMARKS YET
Natural Language Inference in Turkish (NLI-TR) provides translations of two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
2 PAPERS • NO BENCHMARKS YET
NewsPH-NLI is a sentence entailment benchmark dataset in the low-resource Filipino language.
Pars-ABSA is a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.
3 PAPERS • NO BENCHMARKS YET
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
53 PAPERS • 8 BENCHMARKS
The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and RTE5 is combined. Examples are constructed based on news and Wikipedia text.
52 PAPERS • 1 BENCHMARK
RuSentRel is a corpus of analytical articles translated into Russian texts in the domain of international politics obtained from foreign authoritative sources. The collected articles contain both the author's opinion on the subject matter of the article and a large number of references mentioned between the participants of the described situations. In total, 73 large analytical texts were labeled with about 2000 relations.
Visual Entailment (VE) consists of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. SNLI-VE is a dataset for VE which is based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
109 PAPERS • 2 BENCHMARKS
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.
9 PAPERS • 1 BENCHMARK
SelQA is a dataset that consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia.
9 PAPERS • NO BENCHMARKS YET
Story Commonsense is a new large-scale dataset with rich low-level annotations and establishes baseline performance on several new tasks, suggesting avenues for future research.
12 PAPERS • NO BENCHMARKS YET
Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.
16 PAPERS • 1 BENCHMARK
The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of sentences, and the task is to determine whether the second sentence is an entailment of the first one or not. The dataset is used to train and evaluate models on their ability to understand these relationships between sentences.
14 PAPERS • 1 BENCHMARK
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.
341 PAPERS • 1 BENCHMARK
e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.
123 PAPERS • 1 BENCHMARK
esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.