TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Sentiment Analysis	IITP Movie Reviews Sentiment	IndicBERT Base	Accuracy	59.03	# 3
Sentiment Analysis	IITP Product Reviews Sentiment	IndicBERT Base	Accuracy	71.32	# 4
Multiple Choice Question Answering (MCQA)	IndicGLUE WSTP Pa	IndicBERT Large	Accuracy	77.54	# 2
News Classification	Soham News Article Classification	IndicBERT Base	Accuracy	78.45	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/indicnlpsuite-monolingual-corpora-evaluation/multiple-choice-qa-on-indicglue-wstp-pa)](https://paperswithcode.com/sota/multiple-choice-qa-on-indicglue-wstp-pa?p=indicnlpsuite-monolingual-corpora-evaluation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/indicnlpsuite-monolingual-corpora-evaluation/sentiment-analysis-on-iitp-movie-reviews)](https://paperswithcode.com/sota/sentiment-analysis-on-iitp-movie-reviews?p=indicnlpsuite-monolingual-corpora-evaluation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/indicnlpsuite-monolingual-corpora-evaluation/news-classification-on-soham-news-article)](https://paperswithcode.com/sota/news-classification-on-soham-news-article?p=indicnlpsuite-monolingual-corpora-evaluation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/indicnlpsuite-monolingual-corpora-evaluation/sentiment-analysis-on-iitp-product-reviews)](https://paperswithcode.com/sota/sentiment-analysis-on-iitp-product-reviews?p=indicnlpsuite-monolingual-corpora-evaluation)`

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Findings of the Association for Computational Linguistics 2020 · Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar. ·

In this paper, we introduce NLP resources for 11 major Indian languages from two major language families. These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark). The monolingual corpora contains a total of 8.8 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. Lastly, we compile the IndicGLUE benchmark for Indian language NLU. To this end, we create datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA. We also include publicly available datasets for some Indic languages for tasks like Named Entity Recognition, Cross-lingual Sentence Retrieval, Paraphrase detection, etc. Our embeddings are competitive or better than existing pre-trained embeddings on multiple tasks. We hope that the availability of the dataset will accelerate Indic NLP research which has the potential to impact more than a billion people. It can also help the community in evaluating advances in NLP over a more diverse pool of languages. The data and models are available at https://indicnlp.ai4bharat.org.

PDF Abstract

Code

Add Remove Mark official

AI4Bharat/indic-bert

↳ Quickstart in

Colab

271

Tasks

Add Remove

Genre classification

Multiple-choice

Multiple Choice Question Answering (MCQA)

named-entity-recognition

Named Entity Recognition

Named Entity Recognition (NER)

News Classification

Retrieval

Sentence

Sentiment Analysis

Word Embeddings

Datasets

Introduced in the Paper:

IndicCorp IndicGLUE

Results from the Paper

Add Remove

Ranked #2 on Multiple Choice Question Answering (MCQA) on IndicGLUE WSTP Pa

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Sentiment Analysis	IITP Movie Reviews Sentiment	IndicBERT Base	Accuracy	59.03	# 3	Compare
Sentiment Analysis	IITP Product Reviews Sentiment	IndicBERT Base	Accuracy	71.32	# 4	Compare
Multiple Choice Question Answering (MCQA)	IndicGLUE WSTP Pa	IndicBERT Large	Accuracy	77.54	# 2	Compare
News Classification	Soham News Article Classification	IndicBERT Base	Accuracy	78.45	# 3	Compare

Methods

Add Remove

Adam • ALBERT • Dense Connections • GELU • LAMB • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • WordPiece

Edit Social Preview

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove