IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.
Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu
Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.
Downloads
Language | # News Articles* | Sentences | Tokens | Link |
---|---|---|---|---|
as | 0.60M | 1.39M | 32.6M | link |
bn | 3.83M | 39.9M | 836M | link |
en | 3.49M | 54.3M | 1.22B | link |
gu | 2.63M | 41.1M | 719M | link |
hi | 4.95M | 63.1M | 1.86B | link |
kn | 3.76M | 53.3M | 713M | link |
ml | 4.75M | 50.2M | 721M | link |
mr | 2.31M | 34.0M | 551M | link |
or | 0.69M | 6.94M | 107M | link |
pa | 2.64M | 29.2M | 773M | link |
ta | 4.41M | 31.5M | 582M | link |
te | 3.98M | 47.9M | 674M | link |
* Excluding articles obtained from the OSCAR corpus
Paper | Code | Results | Date | Stars |
---|