This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
110 PAPERS • NO BENCHMARKS YET
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
66 PAPERS • 3 BENCHMARKS
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
58 PAPERS • NO BENCHMARKS YET
MasakhaNEWS is a benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
10 PAPERS • NO BENCHMARKS YET
The GATITOS (Google's Additional Translations Into Tail-languages: Often Short) dataset is a high-quality, multi-way parallel dataset of tokens and short phrases, intended for training and improving machine translation models. This dataset consists in 4,000 English segments (4,500 tokens) that have been translated into each of 26 low-resource languages, as well as three higher-resource pivot languages (es, fr, hi). All translations were made directly from English, with the exception of Aymara, which was translated from the Spanish.
2 PAPERS • NO BENCHMARKS YET
PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.
1 PAPER • NO BENCHMARKS YET
Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced speakers * Covers different dialects in Oromo language * Open source for research
1 PAPER • 1 BENCHMARK