The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
192 PAPERS • 18 BENCHMARKS
The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
182 PAPERS • 3 BENCHMARKS
IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies in IMDB. A node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDB-MULTI, the edges are collected from three different genres: Comedy, Romance and Sci-Fi.
98 PAPERS • 2 BENCHMARKS
The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.
36 PAPERS • 3 BENCHMARKS
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.
14 PAPERS • 3 BENCHMARKS
SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.
3 PAPERS • 2 BENCHMARKS
Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.
3 PAPERS • NO BENCHMARKS YET
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
2 PAPERS • NO BENCHMARKS YET
RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.
1 PAPER • NO BENCHMARKS YET