🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

92 dataset results for Named Entity Recognition (NER) AND Texts

The dataset contains two few-shot chemical fine-grained entity extraction datasets, based on human-annotated ChemNER+ and CHEMET. For each dataset, we randomly sample a subset based on the frequency of each type class. Specifically, given a dataset, we first set the number of maximum entity mentions $k$ for the most frequent entity type in the dataset. We then randomly sample other types and ensure that the distribution of each type remains the same as in the original dataset. We choose the values $6, 9, 12, 15, 18$ as the potential maximum entity mentions for $k$. The ChemNER+ and CHEMET few-shot datasets contain 52 and 28 types respectively.

1 PAPER • NO BENCHMARKS YET

BUSTER

BUSTER (BUSiness Transaction Entity Recognition dataset.)

BUSiness Transaction Entity Recognition dataset.

1 PAPER • NO BENCHMARKS YET

UNER v1

UNER v1 (Universal NER v1)

UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.

1 PAPER • 31 BENCHMARKS

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on

1 PAPER • NO BENCHMARKS YET

ViMQ

ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.

3 PAPERS • NO BENCHMARKS YET

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

3 PAPERS • NO BENCHMARKS YET

legal_NER

legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

6 PAPERS • NO BENCHMARKS YET

Financial Language Understanding Evaluation

Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.

1 PAPER • NO BENCHMARKS YET

PcMSP

PcMSP is a dataset annotated from 305 open access scientific articles for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.

2 PAPERS • NO BENCHMARKS YET

DR.BENCH (Diagnostic Reasoning Benchmark for clinical natural language processing)

DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.

3 PAPERS • NO BENCHMARKS YET

FindVehicle

The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle on the road.

3 PAPERS • 1 BENCHMARK

MultiCoNER

MultiCoNER is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions.

42 PAPERS • NO BENCHMARKS YET

Biographical

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 PAPERS • NO BENCHMARKS YET

CareerCoach 2022

The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.

1 PAPER • NO BENCHMARKS YET

HiNER-collapsed

HiNER-collapsed (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed tags (PER, LOC, ORG).

1 PAPER • 1 BENCHMARK

HiNER-original

HiNER-original (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.

1 PAPER • 1 BENCHMARK

TASTEset

TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.

1 PAPER • NO BENCHMARKS YET

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.

14 PAPERS • 3 BENCHMARKS

KIND

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition. It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

2 PAPERS • NO BENCHMARKS YET

KazNERD

KazNERD is a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.

1 PAPER • NO BENCHMARKS YET

ParsTwiner

An open, broad-coverage corpus for informal Persian named entity recognition was collected from Twitter.

3 PAPERS • NO BENCHMARKS YET

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

5 PAPERS • NO BENCHMARKS YET

MobIE

MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.

2 PAPERS • NO BENCHMARKS YET

RadGraph

RadGraph (RadGraph: Extracting Clinical Entities and Relations from Radiology Reports)

RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.

38 PAPERS • NO BENCHMARKS YET

CMeEE

CMeEE (Chinese Medical Named Entity Recognition Dataset)

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.

8 PAPERS • 1 BENCHMARK

DiaKG

DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.

1 PAPER • NO BENCHMARKS YET

DaN+

DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.

4 PAPERS • NO BENCHMARKS YET

KLUE (Korean Language Understanding Evaluation)

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.

19 PAPERS • 1 BENCHMARK

Few-NERD

Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).

72 PAPERS • 3 BENCHMARKS

Rare Diseases Mentions in MIMIC-III

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv.

2 PAPERS • 1 BENCHMARK

Arendt

Digital Edition: Essays from Hannah Arendt We have created a NER dataset from the digital edition "Sechs Essays" by Hannah Arendt. It consists of 23 documents from the period 1932-1976, which are available as TEI files online (see https://hannah-arendt-edition.net/3p.html?lang=de).

1 PAPER • NO BENCHMARKS YET

Sturm

Digital Edition: Sturm Edition Source: Schrade, Torsten: „Startseite“, in: DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde, erarbeitet und herausgegeben von Marjam Trautmann und Torsten Schrade. Mainz, Akademie der Wissenschaften und der Literatur, Version 1 vom 16. Jul. 2018.

1 PAPER • NO BENCHMARKS YET

SumeCzech-NER

SumeCzech-NER contains named entity annotations of SumeCzech 1.0, a Czech news-based summarization dataset.

1 PAPER • NO BENCHMARKS YET

BC7 NLM-Chem

BC7 NLM-Chem (BioCreative VII NLM-Chem)

Full-text chemical identification and indexing in PubMed articles.

2 PAPERS • 3 BENCHMARKS

PhoNER COVID19

PhoNER_COVID19 is a dataset for recognising COVID-19 related named entities in Vietnamese, consisting of 35K entities over 10K sentences. The authors defined 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics.

3 PAPERS • 1 BENCHMARK

MasakhaNER

MasakhaNER is a collection of Named Entity Recognition (NER) datasets for 10 different African languages. The languages forming this dataset are: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá.

47 PAPERS • 2 BENCHMARKS

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

11 PAPERS • 1 BENCHMARK

NEMO-Corpus (NEMO Hebrew NER and Morphology Corpus)

Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. We publish the NEMO corpus in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO^2)" [1], where we use it in extensive experiments and analyses, showing the importance of morphological boundaries for neural modeling of NER in morphologically rich languages. Code for these models and experiments can be found in the NEMO code repo.

3 PAPERS • 3 BENCHMARKS

AMALGUM

AMALGUM (A Machine Annotated Lookalike of GUM)

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.

5 PAPERS • NO BENCHMARKS YET

COVID-Q

COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.

3 PAPERS • NO BENCHMARKS YET

DaNE

DaNE (Danish Dependency Treebank)

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.

5 PAPERS • 5 BENCHMARKS

Dataset of Legal Documents

Dataset of Legal Documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).

1 PAPER • NO BENCHMARKS YET

ScienceExamCER

ScienceExamCER is a collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions.

1 PAPER • NO BENCHMARKS YET

BB (Bacteria Biotope)

The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.

8 PAPERS • 2 BENCHMARKS

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

143 PAPERS • 2 BENCHMARKS

RONEC

RONEC (Romanian Named Entity Corpus)

Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.

4 PAPERS • NO BENCHMARKS YET

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.

118 PAPERS • 7 BENCHMARKS

Second HAREM

Second HAREM (Segundo HAREM)

The Second HAREM was an evaluation exercise in Portuguese Named Entity Recognition. It aims to refine text annotation processes, building on the First HAREM. Challenges include adapting guidelines for new texts and establishing a unified document with directives from both editions.

0 PAPER • NO BENCHMARKS YET

Datasets

92 dataset results for Named Entity Recognition (NER) AND Texts