The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
590 PAPERS • 7 BENCHMARKS
The Pubmed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
405 PAPERS • 14 BENCHMARKS
The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of 592,213 triplets with 14,951 entities and 1,345 relationships. FB15K-237 is a variant of the original dataset where inverse relations are removed, since it was found that a large number of test triplets could be obtained by inverting triplets in the training set.
259 PAPERS • 7 BENCHMARKS
Yet Another Great Ontology (YAGO) is a Knowledge Graph that augments WordNet with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities, while YAGO3 added about 1 million more entities from non-English Wikipedia articles. YAGO3-10 a subset of YAGO3, containing entities which have a minimum of 10 relations each.
204 PAPERS • 7 BENCHMARKS
The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found out that a large number of the test triplets can be found in the training set with another relation or the inverse relation. Therefore, a new version of the dataset WN18RR has been proposed to address this issue.
201 PAPERS • 3 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
188 PAPERS • 18 BENCHMARKS
protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to a different human tissue . positional gene sets are used, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database . The average graph contains 2373 nodes, with an average degree of 28.8.
150 PAPERS • 2 BENCHMARKS
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.
149 PAPERS • 1 BENCHMARK
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
120 PAPERS • 13 BENCHMARKS
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
90 PAPERS • 5 BENCHMARKS
SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more. Image Source: https://snap.stanford.edu/data/
80 PAPERS • NO BENCHMARKS YET
Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.
52 PAPERS • NO BENCHMARKS YET
The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.
32 PAPERS • NO BENCHMARKS YET
BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.
26 PAPERS • 2 BENCHMARKS
STRING is a collection of protein-protein interaction (PPI) networks.
21 PAPERS • NO BENCHMARKS YET
Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.
14 PAPERS • 1 BENCHMARK
EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.
12 PAPERS • NO BENCHMARKS YET
Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to Astro Physics category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
9 PAPERS • 2 BENCHMARKS
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.
8 PAPERS • NO BENCHMARKS YET
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
3 PAPERS • 2 BENCHMARKS
The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.
3 PAPERS • 3 BENCHMARKS
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.
2 PAPERS • NO BENCHMARKS YET
The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of the Corpus has been annotated by three annotators by providing the following layers of annotations, each one characterizing a core aspect of scientific publications:
2 PAPERS • 2 BENCHMARKS
The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.
2 PAPERS • NO BENCHMARKS YET
MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph datasets, these knowledge graphs contain both numerical features and images for all entities as well as entity alignments between pairs of KGs. While MMKG is intended to perform relational reasoning across different entities and images, previous resources are intended to perform visual reasoning within the same image.
2 PAPERS • NO BENCHMARKS YET
The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated with argument components and argumentative relations.
1 PAPER • 2 BENCHMARKS
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
1 PAPER • NO BENCHMARKS YET
The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.
1 PAPER • NO BENCHMARKS YET
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
1 PAPER • 3 BENCHMARKS
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.
1 PAPER • NO BENCHMARKS YET
The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, Wireless Communication, Data Mining). An heterogeneous graph is constructed, which comprises 3025 papers, 5835 authors, and 56 subjects. Paper features correspond to elements of a bag-of-words represented of keywords.
0 PAPER • 1 BENCHMARK