Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them
12 PAPERS • NO BENCHMARKS YET
WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data. WikiGraphs is collected by pairing each Wikipedia article from the established WikiText-103 benchmark with a subgraph from the Freebase knowledge graph. Both the graphs and the text data are of significantly larger scale compared to prior graph-text paired datasets.
3 PAPERS • 1 BENCHMARK
SketchGraphs is a dataset of 15 million sketches extracted from real-world CAD models intended to facilitate research in both ML-aided design and geometric program induction. Each sketch is represented as a geometric constraint graph where edges denote designer-imposed geometric relationships between primitives, the nodes of the graph.
11 PAPERS • NO BENCHMARKS YET
The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements
13 PAPERS • NO BENCHMARKS YET
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed
1 PAPER • NO BENCHMARKS YET
MalNet is a large public graph database, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families.
13 PAPERS • 4 BENCHMARKS
GAP is a graph processing benchmark suite with the goal of helping to standardize graph processing evaluations. The benchmark not only specifies graph kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations. Graph framework developers can demonstrate the generality of their programming model by implementing all of the benchmark's kernels and delivering competitive performance on all of the benchmark's graphs Algorithm designers can use the input graphs and the baseline implementations to demonstrate their contribution. Platform designers and performance analysts can use the suite as a workload representative of graph processing.
48 PAPERS • 1 BENCHMARK
We release 280 synthetic IAM graphs generated using IAM graphs of commercial companies. Specifically, we vary the number of nodes, but keep graph density as is, i.e. in the range of 0.259 ± 0.198 (avg ± std). parameters across real graphs. After fixing node counts we sample with replacement the actual nodes from a real world graph, which is chosen at random. Then we add Gaussian N(0, 0.01) noise to node embeddings and renormalize them. A synthetic graph generated in such a way is an ”upsampled” version of an underlying real world graph.
The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs. -|---| | PascalVOC-SP| Computer Vision | Node Classification | | COCO-SP | Computer Vision | Node Classification | | PCQM-Contact | Quantum Chemistry | Link Prediction | | Peptides-func | Chemistry | Graph Classification | | Peptides-struct | Chemistry | Graph Regression |
41 PAPERS • 5 BENCHMARKS
GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and text-to-knowledge graph (T2G) conversion. It is introduced in the paper "GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation" by Zhijing Jin, Qipeng Guo, Xipeng Qiu, and Zheng Zhang at COLING
7 PAPERS • 2 BENCHMARKS
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification. MAG240M-LSC is a heterogeneous academic graph, and the task is to predict the subject areas of papers situated in the heterogeneous graph (node classification). WikiKG90M-LSC is a knowledge graph, and the task is to impute missing triplets (link prediction). PCQM4M-LSC is a quantum chemistry dataset, and the task is to predict an important molecular property, the HOMO-LUMO gap, of a given molecule (graph regression).
31 PAPERS • 3 BENCHMARKS
YoutubeGraph-Dyn is an evolving graph dataset generated from YouTube real-world interactions. It can be used to study temporal evolution on graphs. YoutubeGraph-Dyn provides intra-day time granularity (with 416 snapshots taken every 6 hours for a period of 104 days), multi-modal relationships that capture different aspects of the data, multiple attributes
…Training graph contains 46K entities, 130 relations, 202K triples. Inference graph contains 30K entities, 130 relations, 77K triples. Validation and test triples to predict belong to the inference graph.
1 PAPER • 1 BENCHMARK
…Training graph contains 10K entities, 96 relations, 78K triples. Inference graph contains 7K entities, 96 relations, 21K triples. Validation and test triples to predict belong to the inference graph.
The KACC benchmark consists of three subtasks that can be applied to knowledge graphs: knowledge abstraction, knowledge concretization and knowledge completion. The knowledge abstraction subtask contains tasks of concept inference, schema prediction and concept graph completion on the two-view KG. The knowledge concretization subtask requires models to do entity graph completion based on the two subgraphs. The knowledge completion subtask consists of typical single-view knowledge graph completion tasks for each subgraph. KACC contains 999,902 entities in the entity graph, with 691 types of relations . The concept graph contains 21,293 concepts with 198 types of meta-relations. There are 2,367,971 cross-links between the two.
Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.
24 PAPERS • NO BENCHMARKS YET
Graph Robustness Benchmark (GRB) provides scalable, unified, modular, and reproducible evaluation on the adversarial robustness of graph machine learning models. GRB has elaborated datasets, unified evaluation pipeline, modular coding framework, and reproducible leaderboards, which facilitate the developments of graph adversarial learning, summarizing existing
4 PAPERS • NO BENCHMARKS YET
MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph datasets, these knowledge graphs contain both numerical features and images for all entities as well as entity alignments between pairs of KGs. The three knowledge graphs augmented with numerical features and images are called FB15k, YAGO15k, and DBPEDIA15k.
42 PAPERS • 5 BENCHMARKS
This webgraph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. This web graph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories that are defined by Facebook.
7 PAPERS • NO BENCHMARKS YET
IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing partial matching and graphs with larger sizes, based on the novel stereo benchmark Image Matching Challenge PhotoTourism (IMC-PT This dataset is released in CVPR 2023 paper Deep Learning of Partial Graph Matching via Differentiable Top-K.
…In particular, graphs are isomorphic if they have the same degree and the task is to classify non-isomorphic graphs.
29 PAPERS • 2 BENCHMARKS
…A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. The dataset has 5,000 graphs and each graph has label 0, 1, or 2.
230 PAPERS • 2 BENCHMARKS
Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology. The four Color-connectivity datasets were created by taking a graph and randomly coloring half of its nodes one color, e.g., red, and the other nodes blue, such that the red nodes either form a single For the underlying graph topology we used: 1) 16x16 2D grid, 2) 32x32 2D grid, 3) Euroroad road network (Šubelj et al. 2011), and 4) Minnesota road network. We sampled a balanced set of 15,000 coloring examples for each graph, except for Minnesota network for which we generated 6,000 examples due to memory constraints. The Color-connectivity task requires combination of local and long-range graph information processing to which most existing message-passing Graph Neural Networks (GNNs) do not scale.
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.
34 PAPERS • 9 BENCHMARKS
AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral Screen Database of Active Compounds.
52 PAPERS • 1 BENCHMARK
Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.
32 PAPERS • NO BENCHMARKS YET
minesweeper is a synthetic graph emulating the eponymous game.
15 PAPERS • 1 BENCHMARK
REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.
137 PAPERS • 2 BENCHMARKS
…EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA. MIMIC-SPARQL dataset provides graph-based EHR QA data where natural language queries are converted to SPARQL instead of SQL
2 PAPERS • NO BENCHMARKS YET
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Exact track. It consists of 200 labelled directed graphs. The graphs range in size up to from N=512 up to N=131072 vertices, and up to 1315170 edges. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N. Those graphs are generally larger and denser, as approximate solutions were still accepted. An example graph would be ``` 4 4 0 2 3 3 1 ``` The dataset can be downloaded here.
Tudataset: A collection of benchmark datasets for learning with graphs
68 PAPERS • 1 BENCHMARK
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Heuristic track. It consists of 200 labelled directed graphs. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N. There is the related PACE 2022 Exact dataset, which was for exact computation; those graphs are generally smaller and sparser, as only exact solutions were accepted. An example graph would be ``` 4 4 0 2 3 3 1 ``` The dataset can be downloaded here.
…It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more. representing communication Citation networks : nodes represent papers, edges represent citations Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs co-purchased products Internet networks : nodes represent computers and edges communication Road networks : nodes represent intersections and edges roads connecting the intersections Autonomous systems : graphs Face-to-face communication networks : networks of face-to-face (non-online) interactions Graph classification datasets : disjoint graphs from different classes
151 PAPERS • NO BENCHMARKS YET
KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .
4 PAPERS • 1 BENCHMARK
…In each graph, nodes represent actors/actress, and there is an edge between them if they appear in the same movie. These graphs are derived from the Action and Romance genres.
283 PAPERS • 2 BENCHMARKS
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs.
813 PAPERS • 16 BENCHMARKS
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
…If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
10 PAPERS • 2 BENCHMARKS
Multi-Modal Hate Speech Detection with Graph Context. 18k+ labels, 8k+ discussions, 900k+ comments.
Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.
21 PAPERS • 1 BENCHMARK
…The dataset has been integrated with Pytorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest versions of PyG or DGL. The UPFD dataset includes two sets of tree-structured graphs curated for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The news retweet graphs were originally extracted by FakeNewsNet. Each graph is a hierarchical tree-structured graph where the root node represents the news; the leaf nodes are Twitter users who retweeted the root news. The dataset statistics is shown below: | Data | #Graphs | #Fake News| #Total Nodes | #Total Edges | #Avg.
Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.
This is a Twitter dataset of 100,386 users along with up to 200 tweets from their timelines with a random-walk-based crawler on the retweet graph, with a subsample of 4,972 which is manually annotated The dataset can be used to examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph.
3 PAPERS • 2 BENCHMARKS
…It is used to generate flow graphs from procedural texts.
PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 node labels for each node.
101 PAPERS • 1 BENCHMARK
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular
2 PAPERS • 1 BENCHMARK