The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.
474 PAPERS • 8 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
36 PAPERS • NO BENCHMARKS YET
WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.
30 PAPERS • 1 BENCHMARK