The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters in 1996-1997. It contains 804,414 manually labeled newswire documents, and categorized with respect to three controlled vocabularies: industries, topics and regions.
320 PAPERS • 6 BENCHMARKS
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
265 PAPERS • 8 BENCHMARKS
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.
48 PAPERS • 4 BENCHMARKS
Hierarchical multi-label classification dataset for functional genomics
2 PAPERS • 1 BENCHMARK
Hierarchical-multilabel classification dataset for functional genomics
SciHTC is a dataset for hierarchical multi-label text classification (HMLTC) of scientific papers which contains 186,160 papers and 1,233 categories from the ACM CCS tree.
1 PAPER • NO BENCHMARKS YET