🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

74 dataset results for Relation Extraction

TimeBankPT is a corpus of Portuguese text with annotations about time. The annotation scheme used is similar to TimeML. TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.

4 PAPERS • 1 BENCHMARK

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.

13 PAPERS • 3 BENCHMARKS

New York Times Annotated Corpus

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

265 PAPERS • 8 BENCHMARKS

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

18 PAPERS • 4 BENCHMARKS

ACE 2004

ACE 2004 (ACE 2004 Multilingual Training Corpus)

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

46 PAPERS • 5 BENCHMARKS

ACE 2005 (ACE 2005 Multilingual Training Corpus)

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

62 PAPERS • 9 BENCHMARKS

BLUE

BLUE (Biomedical Language Understanding Evaluation)

The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

123 PAPERS • NO BENCHMARKS YET

ChemProt

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task.

16 PAPERS • 1 BENCHMARK

Chinese Literature NER RE

Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.

1 PAPER • NO BENCHMARKS YET

CoNLL

The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.

177 PAPERS • 49 BENCHMARKS

CoNLL04

The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has at least one relation. The sentences are annotated with information about entities and their corresponding relation types.

17 PAPERS • 3 BENCHMARKS

DWIE

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

17 PAPERS • 4 BENCHMARKS

FB1.5M

The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.

1 PAPER • NO BENCHMARKS YET

FB15k-237-low

The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.

3 PAPERS • NO BENCHMARKS YET

FUNSD (Form Understanding in Noisy Scanned Documents)

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.

144 PAPERS • 3 BENCHMARKS

FewRel 2.0

A more challenging task to investigate two aspects of few-shot relation classification models: (1) Can they adapt to a new domain with only a handful of instances? (2) Can they detect none-of-the-above (NOTA) relations?

38 PAPERS • NO BENCHMARKS YET

GAD

GAD (Gene Associations Database)

GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.

5 PAPERS • 1 BENCHMARK

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.

18 PAPERS • 2 BENCHMARKS

LPSC

LPSC (Planetary Science Data Set)

This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .

1 PAPER • 2 BENCHMARKS

Medical Case Report Corpus

Medical Case Report Corpus is a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.

1 PAPER • NO BENCHMARKS YET

NYT10-HRL

a dataset from A Hierarchical Framework for Relation Extraction with Reinforcement Learning

4 PAPERS • 1 BENCHMARK

PGR

PGR (Phenotype-Gene Relations)

Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations.

6 PAPERS • 1 BENCHMARK

Part Whole Relations

The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In

1 PAPER • NO BENCHMARKS YET

Perlex

Persian dataset for relation extraction, which is an expert-translated version of the "Semeval-2010-Task-8" dataset.

4 PAPERS • NO BENCHMARKS YET

T-REx

A dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences).

109 PAPERS • 2 BENCHMARKS

Translated TACRED

533 parallel examples sampled from TACRED, translated into Russian and Korean (and 3 additional examples in Russian), accompanied with tranlsation of a list of trigger words collected for the different relations.

1 PAPER • NO BENCHMARKS YET

Datasets

74 dataset results for Relation Extraction