🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language (clear)

41 dataset results for Named Entity Recognition (NER) AND Texts AND English

CoNLL 2003

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

639 PAPERS • 16 BENCHMARKS

OntoNotes 5.0

OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

237 PAPERS • 11 BENCHMARKS

BC5CDR (BioCreative V CDR corpus)

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

174 PAPERS • 6 BENCHMARKS

BLUE

BLUE (Biomedical Language Understanding Evaluation)

The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

123 PAPERS • NO BENCHMARKS YET

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.

120 PAPERS • 7 BENCHMARKS

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.

116 PAPERS • 6 BENCHMARKS

WNUT 2017

WNUT 2017 (WNUT 2017 Emerging and Rare entity recognition)

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarisation), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.

114 PAPERS • 1 BENCHMARK

Few-NERD

Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).

72 PAPERS • 3 BENCHMARKS

CoNLL 2002

The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.

70 PAPERS • 3 BENCHMARKS

ACE 2005 (ACE 2005 Multilingual Training Corpus)

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

62 PAPERS • 9 BENCHMARKS

ACE 2004

ACE 2004 (ACE 2004 Multilingual Training Corpus)

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

46 PAPERS • 5 BENCHMARKS

IPM NEL

IPM NEL (Derczynski IPM Named Entity Linking)

This data is for the task of named entity recognition and linking/disambiguation over tweets. It comprises the addition of an entity URI layer on top of an NER-annotated tweet dataset. The task is to detect entities and then provide a correct link to them in DBpedia, thus disambiguating otherwise ambiguous entity surface forms; for example, this means linking "Paris" to the correct instance of a city named that (e.g. Paris, France vs. Paris, Texas).

31 PAPERS • 1 BENCHMARK

DWIE

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

17 PAPERS • 4 BENCHMARKS

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.

14 PAPERS • 3 BENCHMARKS

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

12 PAPERS • 1 BENCHMARK

Broad Twitter Corpus

This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire.

11 PAPERS • 2 BENCHMARKS

BB (Bacteria Biotope)

The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.

8 PAPERS • 2 BENCHMARKS

CoNLL-2000

CoNLL-2000 is a dataset for dividing text into syntactically related non-overlapping groups of words, so-called text chunking.

8 PAPERS • 2 BENCHMARKS

GUM (Georgetown University Multilayer corpus)

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

8 PAPERS • 1 BENCHMARK

GeoWebNews

GeoWebNews provides test/train examples and enable fine-grained Geotagging and Toponym Resolution (Geocoding). This dataset is also suitable for prototyping and evaluating machine learning NLP models.

7 PAPERS • NO BENCHMARKS YET

MEDIA

The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).

6 PAPERS • NO BENCHMARKS YET

AMALGUM

AMALGUM (A Machine Annotated Lookalike of GUM)

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.

5 PAPERS • NO BENCHMARKS YET

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

5 PAPERS • NO BENCHMARKS YET

BC4CHEMD

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

4 PAPERS • 1 BENCHMARK

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

4 PAPERS • NO BENCHMARKS YET

i2b2 De-identification Dataset

i2b2 De-identification Dataset (Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset)

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

4 PAPERS • 1 BENCHMARK

CORD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.

3 PAPERS • 1 BENCHMARK

FUNSD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.

3 PAPERS • 1 BENCHMARK

FindVehicle

The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle on the road.

3 PAPERS • 1 BENCHMARK

ShARe/CLEF 2014: Task 2 Disorders

3 PAPERS • 2 BENCHMARKS

BC7 NLM-Chem

BC7 NLM-Chem (BioCreative VII NLM-Chem)

Full-text chemical identification and indexing in PubMed articles.

2 PAPERS • 3 BENCHMARKS

Biographical

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 PAPERS • NO BENCHMARKS YET

Rare Diseases Mentions in MIMIC-III

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv.

2 PAPERS • 1 BENCHMARK

THYME-2016

2 PAPERS • 1 BENCHMARK

BUSTER

BUSTER (BUSiness Transaction Entity Recognition dataset.)

BUSiness Transaction Entity Recognition dataset.

1 PAPER • NO BENCHMARKS YET

Chem-FINESE

The dataset contains two few-shot chemical fine-grained entity extraction datasets, based on human-annotated ChemNER+ and CHEMET. For each dataset, we randomly sample a subset based on the frequency of each type class. Specifically, given a dataset, we first set the number of maximum entity mentions $k$ for the most frequent entity type in the dataset. We then randomly sample other types and ensure that the distribution of each type remains the same as in the original dataset. We choose the values $6, 9, 12, 15, 18$ as the potential maximum entity mentions for $k$. The ChemNER+ and CHEMET few-shot datasets contain 52 and 28 types respectively.

1 PAPER • NO BENCHMARKS YET

Financial Language Understanding Evaluation

Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.

1 PAPER • NO BENCHMARKS YET

LPSC

LPSC (Planetary Science Data Set)

This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .

1 PAPER • 2 BENCHMARKS

TASTEset

TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.

1 PAPER • NO BENCHMARKS YET

UNER v1

UNER v1 (Universal NER v1)

UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.

1 PAPER • 31 BENCHMARKS

STEM-ECR

Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources The STEM ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises annotations for scientific entities in scientific Abstracts drawn from 10 disciplines in Science, Technology, Engineering, and Medicine. The annotated entities are further grounded to Wikipedia and Wiktionary, respectively.

0 PAPER • NO BENCHMARKS YET

Datasets

41 dataset results for Named Entity Recognition (NER) AND Texts AND English