TimeBankPT is a corpus of Portuguese text with annotations about time. The annotation scheme used is similar to TimeML. TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.
4 PAPERS • 1 BENCHMARK
Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.
13 PAPERS • 3 BENCHMARKS
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
265 PAPERS • 8 BENCHMARKS
2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.
18 PAPERS • 4 BENCHMARKS
ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
46 PAPERS • 5 BENCHMARKS
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
62 PAPERS • 9 BENCHMARKS
The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.
123 PAPERS • NO BENCHMARKS YET
ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task.
16 PAPERS • 1 BENCHMARK
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
1 PAPER • NO BENCHMARKS YET
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
177 PAPERS • 49 BENCHMARKS
The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has at least one relation. The sentences are annotated with information about entities and their corresponding relation types.
17 PAPERS • 3 BENCHMARKS
The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.
17 PAPERS • 4 BENCHMARKS
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.
3 PAPERS • NO BENCHMARKS YET
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
144 PAPERS • 3 BENCHMARKS
A more challenging task to investigate two aspects of few-shot relation classification models: (1) Can they adapt to a new domain with only a handful of instances? (2) Can they detect none-of-the-above (NOTA) relations?
38 PAPERS • NO BENCHMARKS YET
GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.
5 PAPERS • 1 BENCHMARK
JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.
18 PAPERS • 2 BENCHMARKS
This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .
1 PAPER • 2 BENCHMARKS
Medical Case Report Corpus is a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.
a dataset from A Hierarchical Framework for Relation Extraction with Reinforcement Learning
Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations.
6 PAPERS • 1 BENCHMARK
The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In
Persian dataset for relation extraction, which is an expert-translated version of the "Semeval-2010-Task-8" dataset.
4 PAPERS • NO BENCHMARKS YET
A dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences).
109 PAPERS • 2 BENCHMARKS
533 parallel examples sampled from TACRED, translated into Russian and Korean (and 3 additional examples in Russian), accompanied with tranlsation of a list of trigger words collected for the different relations.