The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, art/written_work and building/hotel. The tags are derivied from Freebase types. The training set consists of Wikipedia articles automatically annotated with distant supervision approach that utilizes the information encoded in anchor links. The test set was annotated manually.
88 PAPERS • 2 BENCHMARKS
OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
61 PAPERS • 11 BENCHMARKS
AIDA CoNLL-YAGO contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003 entity recognition task. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid.
57 PAPERS • 3 BENCHMARKS
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).
31 PAPERS • 3 BENCHMARKS
The Open Entity dataset is a collection of about 6,000 sentences with fine-grained entity types annotations. The entity types are free-form noun phrases that describe appropriate types for the role the target entity plays in the sentence. Sentences were sampled from Gigaword, OntoNotes and web articles. On average each sentence has 5 labels.
30 PAPERS • 2 BENCHMARKS
A large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering.
14 PAPERS • NO BENCHMARKS YET
A dataset for fine-grained entity typing of knowledge graph entities built from Freebase. It can be used to evaluate entity representations and also mention-level entity typing.
8 PAPERS • NO BENCHMARKS YET
GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:
7 PAPERS • 1 BENCHMARK
WikiSRS is a novel dataset of similarity and relatedness judgments of paired Wikipedia entities (people, places, and organizations), as assigned by Amazon Mechanical Turk workers.
2 PAPERS • NO BENCHMARKS YET