The Pubmed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
405 PAPERS • 14 BENCHMARKS
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
324 PAPERS • 6 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
188 PAPERS • 18 BENCHMARKS
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
176 PAPERS • 4 BENCHMARKS
protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to a different human tissue . positional gene sets are used, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database . The average graph contains 2373 nodes, with an average degree of 28.8.
150 PAPERS • 2 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
120 PAPERS • 13 BENCHMARKS
In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.
109 PAPERS • 2 BENCHMARKS
NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to learn over time to read the web. NELL has accumulated over 50 million candidate beliefs by reading the web, and it is considering these at different levels of confidence. NELL has high confidence in 2,810,379 of these beliefs.
99 PAPERS • 1 BENCHMARK
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
90 PAPERS • 5 BENCHMARKS
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
75 PAPERS • 5 BENCHMARKS
Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.
37 PAPERS • NO BENCHMARKS YET
The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.
32 PAPERS • NO BENCHMARKS YET
WebKB is a dataset that includes web pages from computer science departments of various universities. 4,518 web pages are categorized into 6 imbalanced categories (Student, Faculty, Staff, Department, Course, Project). Additionally there is Other miscellanea category that is not comparable to the rest.
29 PAPERS • 3 BENCHMARKS
BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.
26 PAPERS • 2 BENCHMARKS
Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.
12 PAPERS • NO BENCHMARKS YET
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.
8 PAPERS • NO BENCHMARKS YET
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
1 PAPER • 3 BENCHMARKS