New York Times Annotated Corpus

Introduced by Sandhaus et al. in The New York Times Annotated Corpus

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
Over 650,000 article summaries written by library scientists.
Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com. As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".

Source: https://catalog.ldc.upenn.edu/LDC2008T19

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Relation Extraction	NYT	UniRel
Relationship Extraction (Distant Supervised)	New York Times Corpus	KGPOOL
Joint Entity and Relation Extraction	NYT	SPN
Open Information Extraction	NYT	Deepstruct zero-shot
Document Dating	NYT	NeuralDater
Relationship Extraction (Distant Supervised)	NYT	DocDS
Topic Models	NYT	JoSH