General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
1,188 PAPERS • 27 BENCHMARKS
Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is labelled if it is a paraphrase or not by human annotators. The whole set is divided into a training subset (4,076 sentence pairs of which 2,753 are paraphrases) and a test subset (1,725 pairs of which 1,147 are paraphrases).
312 PAPERS • 5 BENCHMARKS
PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
62 PAPERS • 1 BENCHMARK
Paraphrase Adversaries from Word Scrambling (PAWS) is a dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
55 PAPERS • NO BENCHMARKS YET
WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting documents are from WikiReading. A bipartite graph connecting entities and documents is first built and the answer for each query is located by traversal on this graph. Candidates that are type-consistent with the answer and share the same relation in query with the answer are included, resulting in a set of candidates. Thus, WikiHop is a multi-choice style reading comprehension data set. There are totally about 43K samples in training set, 5K samples in development set and 2.5K samples in test set. The test set is not provided. The task is to predict the correct answer given a query and multiple supporting documents.
47 PAPERS • 2 BENCHMARKS
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
40 PAPERS • 5 BENCHMARKS
Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.
20 PAPERS • 1 BENCHMARK
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.
18 PAPERS • 1 BENCHMARK
FLUE is a French Language Understanding Evaluation benchmark. It consists of 5 tasks: Text Classification, Paraphrasing, Natural Language Inference, Constituency Parsing and Part-of-Speech Tagging, and Word Sense Disambiguation.
9 PAPERS • NO BENCHMARKS YET
Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.
4 PAPERS • 1 BENCHMARK
PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge.
3 PAPERS • NO BENCHMARKS YET
Finnish Paraphrase Corpus is a fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in the corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.
2 PAPERS • NO BENCHMARKS YET
This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) whose objective was to write sentences that mean the same as a given sentence but have as different syntactical and lexical properties as possible.
1 PAPER • 1 BENCHMARK
This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.
1 PAPER • NO BENCHMARKS YET