The Gossipcop variant of the UPFD dataset for benchmarking.
3 PAPERS • 1 BENCHMARK
WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.
Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.
2 PAPERS • NO BENCHMARKS YET
The DeepNets-1M dataset is composed of neural network architectures represented as graphs where nodes are operations (convolution, pooling, etc.) and edges correspond to the forward pass flow of data through the network. DeepNets-1M has 1 million training architectures and 1402 in-distribution (ID) and out-of-distribution (OOD) evaluation architectures: 500 validation and 500 testing ID architectures, 100 wide OOD architectures, 100 deep OOD architectures, 100 dense OOD architectures, 100 OOD archtectures without batch normalization, and 2 predefined architectures (ResNet-50 and 12 layer Visual Transformer).
We have characterized 1000 human cancer cell lines and screened them with 100s of compounds. On this website, you will find drug response data and genomic markers of sensitivity.
2 PAPERS • 1 BENCHMARK
The GlassTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of glass transition temperature. The glass transition temperature of the material itself denotes the temperature range over which this glass transition takes place.
This is a Twitter dataset of 100,386 users along with up to 200 tweets from their timelines with a random-walk-based crawler on the retweet graph, with a subsample of 4,972 which is manually annotated as hateful or not through crowdsourcing. The dataset can be used to examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph.
HiAML Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 4.6k CIFAR-10 networks with an accuracy range of [91.11%, 93.44%].
Inception Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 580 CIFAR-10 networks with an accuracy range of [89.08%, 94.03%].
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
Question Answering (QA) is a widely-used framework for developing and evaluating an intelligent machine. In this light, QA on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone toward developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA.
The MarKG dataset has 11,292 entities, 192 relations and 76,424 images, including 2,063 analogy entities and 27 analogy relations. The original intention of MarKG is to provide prior knowledge of analogy entities and relations for better multimodal analogical reasoning.
MetaVD is a Meta Video Dataset for enhancing human action recognition datasets. It provides human-annotated relationship labels between action classes across human action recognition datasets. MetaVD is proposed in the following paper: Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. "MetaVD: A Meta Video Dataset for enhancing human action recognition datasets." Computer Vision and Image Understanding 212 (2021): 103276. [link]
The Nations dataset is a small knowledge graph with 14 entities, 55 relations, and 1992 triples describing countries and their political relationships. This dataset is available for download from https://github.com/ZhenfengLei/KGDatasets.
From Schaub, Michael T., et al. "Random walks on simplicial complexes and the normalized hodge 1-laplacian." SIAM Review 62.2 (2020): 353-391.
This is the dataset used in the PACE 2016 challenge, Track B, which was computing minimal Feedback Vertex Set. This competition focused on exact solutions, i.e. provably minimal feedback vertex sets (and no heuristic solutions). This should not be confused with the PACE 2022 challenge, which focused on directed feedback vertex set, and has its own entries on PapersWithCode (exact and heuristic).
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Exact track. It consists of 200 labelled directed graphs. The graphs range in size up to from N=512 up to N=131072 vertices, and up to 1315170 edges. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N.
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).
PointPattern is a graph classification dataset constructed by simple point patterns from statistical mechanics. The authors simulated three point patterns in 2D: hard disks in equilibrium (HD), Poisson point process, and random sequential adsorption (RSA) of disks. The HD and Poisson distributions can be seen as simple models that describe the microstructures of liquids and gases while the RSA is a nonequilibrium stochastic process that introduces new particles one by one subject to nonoverlapping conditions.
Rent3D++ is an extension of the Rent3D floorplans + photos dataset. The floorplans are annotated with room outline polygons, doors/windows as line segments, object-icons as axis-aligned bounding boxes, room-door-room connectivity graphs, and photo-room assignments. We have extracted rectified surface crops from architectural surfaces in photos, and these can drive interior texturing/material modeling tasks. This dataset can be used with our paper Plan2Scene to generate textured 3D mesh models of houses using floorplans and photos.
SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).
The Toulouse Road Network dataset describes patches of road maps from the city of Toulouse, represented both as spatial graphs G = (A, X) and as grayscale segmentation images.
The dataset includes two parts corresponding to the cities of Abakan (65524 nodes, 340012 edges) and Omsk (231688 nodes, 1149492 edges). Along with the road network graph, it includes trip records represented as sequences of visited nodes (making the dataset suitable both for path-blind and path-aware settings). There are two types of target values for a regression task: real travel time and real length of a trip.
Two-Path Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 6.9k CIFAR-10 networks with an accuracy range of [85.53%, 92.34%].
The PolitiFact variant of the UPFD dataset for benchmarking.
VirtualHome2KG is a system for constructing and augmenting knowledge graphs (KGs) of daily living activities using virtual space. We also provide an ontology to describe the structure of the KGs. We used VirtualHome as a platform of virtual space simulation. Thus, this repository is an extension of the virtualhome. Please see the original repository of the virtualhome for details of the Unity simulation.
Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.
This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498
Dataset of low fidelity resolutions of the RANS equations over airfoils.
1 PAPER • NO BENCHMARKS YET
AutoFR Dataset is broken down by each site that we crawl within a zip file. It contains multiple different experiments that we conducted in our paper. The overall dataset contains 1042 sites that we crawled where we detected ads within the Top-5K.
BeGin provides 23 benchmark scenarios for graph from 14 real-world datasets, which cover 12 combinations of the incremental settings and the levels of problem. In addition, BeGin provides various basic evaluation metrics for measuring the performances and final evalution metrics designed for continual learning.
The original paper contains a high-level explanation of the dataset characteristics, and potential use cases of the dataset. ArchABM can help to quantify the impact of some of these building- and company policy-related measures.
The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures. The crystal structures used in CHILI-100K are obtained from a curated subset from the Crystallography Open Database (COD) and has a broad chemical scope covering database entries for 68 metals and 11 non-metals.
1 PAPER • 8 BENCHMARKS
The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types. This dataset has a narrow chemical scope focused on an interesting part of chemical space with a lot of active research.
Description This repository includes the experiment results, source code, and test data for Three Cs risk inference, using the CIRO (COVID-19 Infection Risk Ontology) and HermiT.
CTFW is a large annotated procedural text dataset in the cybersecurity domain (3154 documents). It is used to generate flow graphs from procedural texts.
Classifying all cells in an organ is a relevant and difficult problem from plant developmental biology. We here abstract the problem into a new benchmark for node classification in a geo-referenced graph. Solving it requires learning the spatial layout of the organ including symmetries. To allow the convenient testing of new geometrical learning methods, the benchmark of Arabidopsis thaliana ovules is made available as a PyTorch data loader, along with a large number of precomputed features.
1 PAPER • 1 BENCHMARK
ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
Clickable heat-map visualizations of the experiments run to quantify the Classic ECN AQM problem and to evaluate the success of the Classic AQM Detection and Fall-back algorithm.
Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology.
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Spanish) is a subset of DPB-5L with Spanish KG.
Main Dataset city_pollution_data.csv
DPPIN is a collection of dynamic networks, which consists of twelve generated dynamic protein-protein interaction networks of yeast cells, stored in twelve folders.
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total of 91 872 labelled images for fire risk assessment. This remote sensing dataset is labelled with the fire risk classes supplied by the Wildfire Hazard Potential (WHP) raster dataset, and remote sensing images are collected using the National Agriculture Imagery Program (NAIP), a high-resolution remote sensing imagery program. On FireRisk, we present benchmark performance for supervised and self-supervised representations, with Masked Autoencoders (MAE) pre-trained on ImageNet1k achieving the highest classification accuracy, 65.29%.
This repository is an extension of GEval. This repository contains a (software) evaluation framework to perform evaluation and comparison on RDF-star graph embedding techniques. The gold standard datasets for evaluation were created from KGRC-RDF-star. Please see here.
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
Genre annotations for movies The file genre2movies.csv contains genre-movie tuples based on Wikidata annotations (https://www.wikidata.org/).