OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
30 PAPERS • NO BENCHMARKS YET
The MULTEXT-East resources are a multilingual dataset for language engineering research and development. It consists of the (1) MULTEXT-East morphosyntactic specifications, defining categories (parts-of-speech), their morphosyntactic features (attributes and values), and the compact MSD tagset representations; (2) morphosyntactic lexica, (3) the annotated parallel "1984" corpus; and (4) some comparable text and speech corpora. The specifications are available for the following macrolanguages, languages and language varieties: Albanian, Bulgarian, Chechen, Czech, Damaskini, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbo-Croatian, Slovak, Slovene, Torlak, and Ukrainian, while the other resources are available for a subset of these languages.
21 PAPERS • NO BENCHMARKS YET
WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
19 PAPERS • 7 BENCHMARKS
This dataset contains orthographic samples of words in 19 languages (ar, br, de, en, eno, ent, eo, es, fi, fr, fro, it, ko, nl, pt, ru, sh, tr, zh). Each sample contains two text features: a Word (the textual representation of the word according to its orthography) and a Pronunciation (the highest-surface IPA pronunciation of the word as pronunced in its language).
1 PAPER • NO BENCHMARKS YET