This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
110 PAPERS • NO BENCHMARKS YET
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
67 PAPERS • 3 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
64 PAPERS • NO BENCHMARKS YET
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
58 PAPERS • NO BENCHMARKS YET
UzWordnet is a lexical-semantic database, or a “word-net”, for the (Northern) Uzbek language (native: O’zbek till) compatible with Princeton Wordnet. By providing it open source (see License), we aim to motivate, support, and increase the application of database and knowledge graphs principles and techniques to the study of computational aspects of the (Northern) Uzbek language and, more generally, the usability of Uzbek within IT applications and the Internet.
3 PAPERS • NO BENCHMARKS YET
Dataset Summary INCLUDE is a comprehensive knowledge- and reasoning-centric benchmark across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. It contains 22,637 4-option multiple-choice-questions (MCQ) extracted from academic and professional exams, covering 57 topics, including regional knowledge.
1 PAPER • NO BENCHMARKS YET
The Uzbek speech corpus (USC) comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. This is the first open-source Uzbek speech corpus dedicated to the ASR task.