12 dataset results for Token Classification AND English

CoNLL 2003

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

635 PAPERS • 16 BENCHMARKS

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.

505 PAPERS • 15 BENCHMARKS

WNUT 2017

WNUT 2017 (WNUT 2017 Emerging and Rare entity recognition)

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarisation), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.

111 PAPERS • 2 BENCHMARKS

CoNLL 2002

The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.

69 PAPERS • 3 BENCHMARKS

WikiANN

WikiANN (PAN-X)

WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.

56 PAPERS • 7 BENCHMARKS

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan

19 PAPERS • 3 BENCHMARKS

Acronym Identification

Is an acronym disambiguation (AD) dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset.

9 PAPERS • 1 BENCHMARK

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

5 PAPERS • NO BENCHMARKS YET

BC4CHEMD

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

4 PAPERS • 2 BENCHMARKS

FiNER-139

FiNER-139 is comprised of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of 139 entity types. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself.

3 PAPERS • 1 BENCHMARK

WildReceipt

WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class.

3 PAPERS • 1 BENCHMARK

PLOD-filtered (PLOD: An Abbreviation Detection Dataset for Scientific Documents)

PLOD: An Abbreviation Detection Dataset

1 PAPER • 2 BENCHMARKS

Datasets

12 dataset results for Token Classification AND English