This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
110 PAPERS • NO BENCHMARKS YET
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
69 PAPERS • 3 BENCHMARKS
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
60 PAPERS • NO BENCHMARKS YET
MasakhaNER is a collection of Named Entity Recognition (NER) datasets for 10 different African languages. The languages forming this dataset are: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá.
54 PAPERS • 2 BENCHMARKS
MasakhaNEWS is a benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
10 PAPERS • NO BENCHMARKS YET
HERDPhobia is an annotated hate speech detection dataset on Fulani herders in Nigeria -- in three languages: English, Nigerian-Pidgin, and Hausa.
2 PAPERS • NO BENCHMARKS YET
PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.
1 PAPER • NO BENCHMARKS YET
PolyNews is a multilingual parallel dataset containing news titles 833 language pairs, spanning in 64 languages and 17 scripts.