WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
58 PAPERS • 3 BENCHMARKS
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.
14 PAPERS • 3 BENCHMARKS
Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.
4 PAPERS • NO BENCHMARKS YET