General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
1,188 PAPERS • 27 BENCHMARKS
Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is labelled if it is a paraphrase or not by human annotators. The whole set is divided into a training subset (4,076 sentence pairs of which 2,753 are paraphrases) and a test subset (1,725 pairs of which 1,147 are paraphrases).
312 PAPERS • 5 BENCHMARKS
The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena. Each pair of sentences is annotated in two dimensions: relatedness and entailment. The relatedness score ranges from 1 to 5, and Pearson’s r is used for evaluation; the entailment relation is categorical, consisting of entailment, contradiction, and neutral. There are 4439 pairs in the train split, 495 in the trial split used for development and 4906 in the test split. The sentence pairs are generated from image and video caption datasets before being paired up using some algorithm.
164 PAPERS • 3 BENCHMARKS
SentEval is a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural language inference and sentence similarity. The set of tasks was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. The toolkit comes with scripts to download and preprocess datasets, and an easy interface to evaluate sentence encoders.
89 PAPERS • 1 BENCHMARK
STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.
28 PAPERS • 1 BENCHMARK
EVALution dataset is evenly distributed among the three classes (hypernyms, co-hyponyms and random) and involves three types of parts of speech (noun, verb, adjective). The full dataset contains a total of 4,263 distinct terms consisting of 2,380 nouns, 958 verbs and 972 adjectives.
24 PAPERS • NO BENCHMARKS YET
STS-2014 is from SemEval-2014, constructed from image descriptions, news headlines, tweet news, discussion forums, and OntoNotes.
23 PAPERS • NO BENCHMARKS YET
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.
18 PAPERS • 1 BENCHMARK
CARER is an emotion dataset collected through noisy labels, annotated via distant supervision as in (Go et al., 2009). It uses the eight basic emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The hashtags (339 total) serve as noisy labels, which allow annotation via distant supervision as in (Go et al., 2009).
15 PAPERS • 3 BENCHMARKS
Publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists.
12 PAPERS • NO BENCHMARKS YET
KorNLI is a Korean Natural Language Inference (NLI) dataset. The dataset is constructed by automatically translating the training sets of the SNLI, XNLI and MNLI datasets. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. It contains 942,854 training examples translated automatically and 7,500 evaluation (development and test) examples translated manually
11 PAPERS • NO BENCHMARKS YET
Semantic Textual Similarity (2012 - 2016) involves a set of semantic textual similarity datasets that were part of previous shared tasks (2012-2016):
9 PAPERS • 5 BENCHMARKS
PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.
7 PAPERS • NO BENCHMARKS YET
Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs.
6 PAPERS • 3 BENCHMARKS