The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
17 PAPERS • 5 BENCHMARKS
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).
16 PAPERS • 9 BENCHMARKS
Fashion 144K is a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information.
11 PAPERS • NO BENCHMARKS YET
A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing.
8 PAPERS • 5 BENCHMARKS
OPOSUM is a dataset for the training and evaluation of Opinion Summarization models which contains Amazon reviews from six product domains: Laptop Bags, Bluetooth Headsets, Boots, Keyboards, Televisions, and Vacuums. The six training collections were created by downsampling from the Amazon Product Dataset introduced in McAuley et al. (2015) and contain reviews and their respective ratings.
5 PAPERS • NO BENCHMARKS YET