This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
99 PAPERS • NO BENCHMARKS YET
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
56 PAPERS • NO BENCHMARKS YET
The IndoSum dataset is a benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries.
9 PAPERS • NO BENCHMARKS YET
A large-scale Indonesian summarization dataset consisting of harvested articles from Liputan6.com, an online news portal, resulting in 215,827 document-summary pairs.
5 PAPERS • NO BENCHMARKS YET