The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters in 1996-1997. It contains 804,414 manually labeled newswire documents, and categorized with respect to three controlled vocabularies: industries, topics and regions.
334 PAPERS • 6 BENCHMARKS
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
261 PAPERS • 9 BENCHMARKS
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.
56 PAPERS • 4 BENCHMARKS
EURLEX57K is a new publicly available legal LMTC dataset, dubbed EURLEX57K, containing 57k English EU legislative documents from the EUR-LEX portal, tagged with ∼4.3k labels (concepts) from the European Vocabulary (EUROVOC).
23 PAPERS • 2 BENCHMARKS
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, we present the BIOSCAN-5M Insect dataset to the machine learning community. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical information, and specimen size.
3 PAPERS • NO BENCHMARKS YET
Hierarchical multi-label classification dataset for functional genomics
2 PAPERS • 1 BENCHMARK
Hierarchical-multilabel classification dataset for functional genomics
SciHTC is a dataset for hierarchical multi-label text classification (HMLTC) of scientific papers which contains 186,160 papers and 1,233 categories from the ACM CCS tree.
2 PAPERS • NO BENCHMARKS YET
The WOS Hierarchical Text Classification are three dataset variants created from Web of Science (WOS) title and abstract data categorised into a hierarchical, multi-label class structure. The aim of the sampling and filtering methodology used was to create well-balanced class distributions (at chosen hierarchical levels). Furthermore, the WOS_JTF variant was also created with the aim to only contain publication data such that their class assignments results is classes instances that semantically more similar.
1 PAPER • NO BENCHMARKS YET