The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.
330 PAPERS • 10 BENCHMARKS
The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena. Each pair of sentences is annotated in two dimensions: relatedness and entailment. The relatedness score ranges from 1 to 5, and Pearson’s r is used for evaluation; the entailment relation is categorical, consisting of entailment, contradiction, and neutral. There are 4439 pairs in the train split, 495 in the trial split used for development and 4906 in the test split. The sentence pairs are generated from image and video caption datasets before being paired up using some algorithm.
322 PAPERS • 4 BENCHMARKS
OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. No human/machine translation is used in creating the dataset, and thus the Chinese texts are original and not translated.
40 PAPERS • 3 BENCHMARKS
XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan
20 PAPERS • 2 BENCHMARKS
KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for the KUAKE-QQR task. Similar to KUAKE-QTR, the task aims to estimate query-query relevance, which is an essential and challenging task in real-world search engines.
6 PAPERS • 1 BENCHMARK
KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used for the KUAKE-QTR task. Given a query (e.g., “Symptoms of vitamin B deficiency”), the task aims to find the relevant title (e.g., “The main manifestations of vitamin B deficiency”).
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
3 PAPERS • 1 BENCHMARK