IndicCorp Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

**Languages covered**: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

**Corpus Format**: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

**Downloads**

| Language | \# News Articles* | Sentences     | Tokens        | Link     |
| -------- | ----------------- | ------------- | ------------- | -------- |
| as       | 0.60M             | 1.39M   |  32.6M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/as.tar.xz) |
| bn       | 3.83M             | 39.9M | 836M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/bn.tar.xz) |
| en       | 3.49M             | 54.3M | 1.22B | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/en.tar.xz) |
| gu       | 2.63M             | 41.1M | 719M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/gu.tar.xz) |
| hi       | 4.95M             | 63.1M |  1.86B | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/hi.tar.xz) |
| kn       | 3.76M             | 53.3M | 713M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/bn.tar.xz) |
| ml       | 4.75M             | 50.2M |  721M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/ml.tar.xz) |
| mr       | 2.31M             | 34.0M | 551M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/mr.tar.xz) |
| or       | 0.69M             | 6.94M   | 107M   | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/or.tar.xz) |
| pa       | 2.64M             | 29.2M |  773M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/pa.tar.xz) |
| ta       | 4.41M             |  31.5M   |  582M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/ta.tar.xz) |
| te       | 3.98M             | 47.9M   |  674M  | [link](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/te.tar.xz) |

\* Excluding articles obtained from the OSCAR corpus

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/ai4-logo.png Clear

Change

---

IndicCorp

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

Sangraha

Naamapadam

IndicGLUE

PMIndia

Usage

License

Modalities

Languages

Language	# News Articles*	Sentences	Tokens	Link
as	0.60M	1.39M	32.6M	link
bn	3.83M	39.9M	836M	link
en	3.49M	54.3M	1.22B	link
gu	2.63M	41.1M	719M	link
hi	4.95M	63.1M	1.86B	link
kn	3.76M	53.3M	713M	link
ml	4.75M	50.2M	721M	link
mr	2.31M	34.0M	551M	link
or	0.69M	6.94M	107M	link
pa	2.64M	29.2M	773M	link
ta	4.41M	31.5M	582M	link
te	3.98M	47.9M	674M	link