CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.
747 PAPERS • 19 BENCHMARKS
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
184 PAPERS • 47 BENCHMARKS
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
66 PAPERS • 3 BENCHMARKS
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark was introduced to encourage more research on multilingual transfer learning,. XTREME covers 40 typologically diverse languages spanning 12 language families and includes 9 tasks that require reasoning about different levels of syntax or semantics.
52 PAPERS • 2 BENCHMARKS
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
6 PAPERS • NO BENCHMARKS YET
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
3 PAPERS • 2 BENCHMARKS
Dataset Description
1 PAPER • NO BENCHMARKS YET
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The MIM-GOLD-NER dataset is an Icelandic named entity (NE) corpus. It is a version of the MIM-GOLD corpus that has been specifically tagged for named entities. In this dataset, over 48,000 NEs (named entities) are labeled within a corpus of one million tokens. Researchers and developers can use this dataset to train named entity recognizers for Icelandic¹²³.
0 PAPER • 1 BENCHMARK
The Tagalog Universal Dependencies NewsCrawl dataset consists of annotated text extracted from the Leipzig Tagalog Corpus. Data included in the Leipzig Tagalog Corpus were crawled from Tagalog-language online news sites by the Leipzig University Institute for Computer Science.
0 PAPER • NO BENCHMARKS YET