6 dataset results for Named Entity Recognition (NER) AND Hindi

WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.

58 PAPERS • 3 BENCHMARKS

IndicGLUE

IndicGLUE (Indic General Language Understanding Evaluation Benchmark)

We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.

14 PAPERS • 3 BENCHMARKS

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

3 PAPERS • NO BENCHMARKS YET

HiNER-collapsed

HiNER-collapsed (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed tags (PER, LOC, ORG).

1 PAPER • 1 BENCHMARK

HiNER-original

HiNER-original (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.

1 PAPER • 1 BENCHMARK

IECSIL FIRE-2018 Shared Task

The dataset is taken from the First shared task on Information Extractor for Conversational Systems in Indian Languages (IECSIL) . It consists of 15,48,570 Hindi words in Devanagari script and corresponding NER labels. Each sentence end is marked by \newline" tag. Fig. 1 shows a snapshot of one sentence in the dataset. Our Dataset has nine classes, namely, Datenum, Event, Location, Name, Number, Occupation, Organization, Other, Things.

1 PAPER • 1 BENCHMARK

Datasets

6 dataset results for Named Entity Recognition (NER) AND Hindi