WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
58 PAPERS • 3 BENCHMARKS
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.
14 PAPERS • 3 BENCHMARKS
Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.
3 PAPERS • NO BENCHMARKS YET
This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed tags (PER, LOC, ORG).
1 PAPER • 1 BENCHMARK
This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The dataset is taken from the First shared task on Information Extractor for Conversational Systems in Indian Languages (IECSIL) . It consists of 15,48,570 Hindi words in Devanagari script and corresponding NER labels. Each sentence end is marked by \newline" tag. Fig. 1 shows a snapshot of one sentence in the dataset. Our Dataset has nine classes, namely, Datenum, Event, Location, Name, Number, Occupation, Organization, Other, Things.