The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
543 PAPERS • 20 BENCHMARKS
The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
308 PAPERS • 3 BENCHMARKS
IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies in IMDB. A node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDB-MULTI, the edges are collected from three different genres: Comedy, Romance and Sci-Fi.
238 PAPERS • 2 BENCHMARKS
The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.
73 PAPERS • 21 BENCHMARKS
The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.
64 PAPERS • 6 BENCHMARKS
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.
52 PAPERS • 4 BENCHMARKS
SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.
46 PAPERS • 4 BENCHMARKS
The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus.
33 PAPERS • 1 BENCHMARK
MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.
11 PAPERS • NO BENCHMARKS YET
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
6 PAPERS • NO BENCHMARKS YET
Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.
3 PAPERS • NO BENCHMARKS YET
RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.
2 PAPERS • NO BENCHMARKS YET
Contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database.
1 PAPER • NO BENCHMARKS YET
RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.
RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.
Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, citations, and writings. The paper table features attributes such as year, title, and abstract, while the author table includes name and affiliation details. Additionally, some feature completion has been performed for the papers. The dataset also defines a standard classification task for predicting the conference to which a paper belongs.
1 PAPER • 1 BENCHMARK