This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
48 PAPERS • NO BENCHMARKS YET
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
30 PAPERS • NO BENCHMARKS YET
WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
19 PAPERS • 7 BENCHMARKS
Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
17 PAPERS • NO BENCHMARKS YET
IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.
12 PAPERS • NO BENCHMARKS YET
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.
9 PAPERS • 3 BENCHMARKS
The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.
8 PAPERS • NO BENCHMARKS YET
The Kannada-MNIST dataset is a drop-in substitute for the standard MNIST dataset for the Kannada language.
6 PAPERS • NO BENCHMARKS YET
This dataset contains speech recordings along with speaker physical parameters (height, weight, shoulder size, age ) as well as regional information and linguistic information.
3 PAPERS • NO BENCHMARKS YET
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
2 PAPERS • NO BENCHMARKS YET
KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The dataset consists of 6,176 user-generated comments in code mixed Kannada crawled from YouTube and manually labelled as bearing hope speech or not-hope speech.
2 PAPERS • 1 BENCHMARK
IRLCov19 is a multilingual Twitter dataset related to Covid-19 collected in the period between February 2020 to July 2020 specifically for regional languages in India. It contains more than 13 million tweets.
1 PAPER • NO BENCHMARKS YET
A special corpus of Indian languages covering 13 major languages of India. It comprises of 10000+ spoken sentences/utterances each of mono and English recorded by both Male and Female native speakers. Speech waveform files are available in .wav format along with the corresponding text. We hope that these recordings will be useful for researchers and speech technologists working on synthesis and recognition. You can request zip archives of the entire database here.
1 PAPER • 1 BENCHMARK
We announce the release of a new multilingual speaker dataset called NITK-IISc Multilingual Multi-accent Speaker Profiling(NISP) dataset. The dataset contains speech in six different languages -- five Indian languages along with Indian English. The dataset contains speech data from 345 bilingual speakers in India. Each speaker has contributed about 4-5 minutes of data that includes recordings in both English and their mother tongue. The transcript for the text is provided in UTF-8 format. For every speaker, the dataset contains speaker meta-data such as L1, native place, medium of instruction, current residing place etc. In addition the dataset also contains physical parameter information of the speakers such as age, height, shoulder size and weight. We hope that the dataset is useful for a diverse set of research activities including multilingual speaker recognition, language and accent recognition, automatic speech recognition etc.
1 PAPER • NO BENCHMARKS YET