🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

63 dataset results for Link Prediction

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.

1,577 PAPERS • 11 BENCHMARKS

MovieLens

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time. These preferences were entered by way of the MovieLens web site1 — a recommender system that asks its users to give movie ratings in order to receive personalized movie recommendations.

1,092 PAPERS • 16 BENCHMARKS

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

1,063 PAPERS • 24 BENCHMARKS

OGB (Open Graph Benchmark)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.

808 PAPERS • 16 BENCHMARKS

FB15k (Freebase 15K)

The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of 592,213 triplets with 14,951 entities and 1,345 relationships. FB15K-237 is a variant of the original dataset where inverse relations are removed, since it was found that a large number of test triplets could be obtained by inverting triplets in the training set.

565 PAPERS • 9 BENCHMARKS

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

492 PAPERS • 20 BENCHMARKS

WN18

WN18 (WordNet18)

The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found out that a large number of the test triplets can be found in the training set with another relation or the inverse relation. Therefore, a new version of the dataset WN18RR has been proposed to address this issue.

430 PAPERS • 5 BENCHMARKS

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, many triples are inverses that cause leakage from the training to testing and validation splits. FB15k-237 was created by Toutanova and Chen (2015) to ensure that the testing and evaluation datasets do not have inverse relation test leakage. In summary, FB15k-237 dataset contains 310,116 triples with 14,541 entities and 237 relation types.

402 PAPERS • 3 BENCHMARKS

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.

338 PAPERS • 3 BENCHMARKS

YAGO (Yet Another Great Ontology)

Yet Another Great Ontology (YAGO) is a Knowledge Graph that augments WordNet with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities, while YAGO3 added about 1 million more entities from non-English Wikipedia articles. YAGO3-10 a subset of YAGO3, containing entities which have a minimum of 10 relations each.

335 PAPERS • 7 BENCHMARKS

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

314 PAPERS • 14 BENCHMARKS

PPI

PPI (Protein-Protein Interactions (PPI))

protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to a different human tissue [41]. positional gene sets are used, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database [34]. The average graph contains 2373 nodes, with an average degree of 28.8.

285 PAPERS • 2 BENCHMARKS

COLLAB

COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. A researcher’s ego network has three possible labels, i.e., High Energy Physics, Condensed Matter Physics, and Astro Physics, which are the fields that the researcher belongs to. The dataset has 5,000 graphs and each graph has label 0, 1, or 2.

230 PAPERS • 2 BENCHMARKS

DBLP

DBLP (Citation Network Dataset)

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

205 PAPERS • 5 BENCHMARKS

NELL (Never Ending Language Learning)

NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to learn over time to read the web. NELL has accumulated over 50 million candidate beliefs by reading the web, and it is considering these at different levels of confidence. NELL has high confidence in 2,810,379 of these beliefs.

166 PAPERS • 4 BENCHMARKS

SNAP (Stanford Large Network Dataset Collection)

SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.

150 PAPERS • NO BENCHMARKS YET

NELL-995

NELL-995 KG Completion Dataset

106 PAPERS • 2 BENCHMARKS

ICEWS

A repository that contains political events with a specific timestamp. These political events relate entities (e.g. countries, presidents...) to a number of other entities via logical predicates (e.g. ’Make a visit’ or ’Express intent to meet or negotiate’).

97 PAPERS • 2 BENCHMARKS

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars are viewed as items. We use the same 10-core setting in order to ensure data quality.

93 PAPERS • 2 BENCHMARKS

Orkut

Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.

79 PAPERS • NO BENCHMARKS YET

Douban

Douban (Douban Conversation Corpus)

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.

77 PAPERS • 4 BENCHMARKS

Wiki-CS

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

73 PAPERS • 2 BENCHMARKS

AMiner

The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.

72 PAPERS • 1 BENCHMARK

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

68 PAPERS • 21 BENCHMARKS

Wikidata5M

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

46 PAPERS • 1 BENCHMARK

Long Range Graph Benchmark (LRGB)

The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.

41 PAPERS • 5 BENCHMARKS

MMKG

MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph datasets, these knowledge graphs contain both numerical features and images for all entities as well as entity alignments between pairs of KGs. While MMKG is intended to perform relational reasoning across different entities and images, previous resources are intended to perform visual reasoning within the same image.

41 PAPERS • 5 BENCHMARKS

STRING

STRING is a collection of protein-protein interaction (PPI) networks.

34 PAPERS • NO BENCHMARKS YET

BioGRID

BioGRID (Biological General Repository for Interaction Datasets)

BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.

33 PAPERS • 2 BENCHMARKS

Decagon (Bio-decagon)

Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.

31 PAPERS • 1 BENCHMARK

OGB-LSC (OGB Large-Scale Challenge)

OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.

31 PAPERS • 3 BENCHMARKS

Email-EU

EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.

30 PAPERS • NO BENCHMARKS YET

YAGO3-10

YAGO3-10 (Yet Another Great Ontology 3-10)

YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of YAGO) that contains entities associated with at least ten different relations. In total, YAGO3-10 has 123,182 entities and 37 relations and 1,179,040 triples, and most of the triples describe attributes of persons such as citizenship, gender, and profession.

20 PAPERS • 1 BENCHMARK

UMLS

UMLS (Unified Medical Language System)

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:

18 PAPERS • 1 BENCHMARK

CDCP

CDCP (Cornell eRulemaking Corpus)

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.

11 PAPERS • 3 BENCHMARKS

FB122

FB122 (Freebase-122)

10 PAPERS • 1 BENCHMARK

arXiv Astro-Ph

Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to Astro Physics category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.

10 PAPERS • 2 BENCHMARKS

Wiki

Wiki (Web Traffic Time Series Forecasting)

Context There's a story behind every dataset and here's your opportunity to share yours.

7 PAPERS • 3 BENCHMARKS

TSP/HCP Benchmark set

This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. In particular, it focuses on small instances which prove to be challenging for one or more state-of-the-art TSP algorithms. These instances are based on difficult instances of Hamiltonian cycle problem (HCP). This includes instances from literature, specially modified randomly generated instances, and instances arising from the conversion of other difficult problems to HCP.

5 PAPERS • 1 BENCHMARK

ACM

ACM (Association for Computing Machinery Active Contour Model algebraic collective model and-Compare Module Active Contour Models)

The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, Wireless Communication, Data Mining). An heterogeneous graph is constructed, which comprises 3025 papers, 5835 authors, and 56 subjects. Paper features correspond to elements of a bag-of-words represented of keywords.

4 PAPERS • 1 BENCHMARK

CoDEx Medium

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

4 PAPERS • 1 BENCHMARK

IS-A

The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.

4 PAPERS • NO BENCHMARKS YET

KG20C

KG20C (A scholarly knowledge graph benchmark dataset)

KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .

4 PAPERS • 1 BENCHMARK

ProteinKG25

ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and proteins entities. ProteinKG25 contains 4,990,097 triplets (4,879,951 Protein-GO triplets and 110,146 GO-GO triplets), 612,483 entities (565,254 proteins and 47,229 GO terms) and 31 relations.

4 PAPERS • NO BENCHMARKS YET

Arxiv GR-QC

Arxiv GR-QC (General Relativity and Quantum Cosmology collaboration network)

Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.

3 PAPERS • 2 BENCHMARKS

CoDEx Large

3 PAPERS • 1 BENCHMARK

CoDEx Small

3 PAPERS • 1 BENCHMARK

FB15k-237-low

The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.

3 PAPERS • NO BENCHMARKS YET