🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

243 dataset results for Graphs

RARE

RARE (Randomized AMRs with Rewired Edges)

RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them.

5 PAPERS • 1 BENCHMARK

Synthetic Dynamic Networks (from Aging, Fitness Preferential Attachment mechanisms)

This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks of size 20,000 each generated in accordance to different combination of three mechanisms: fitness, aging and preferential attachment. The goal is to use machine learning to identify the combination of mechanisms that was used to create the network. The dataset includes static features from the literature and two version of our newly developed dynamic features. net

1 PAPER • 1 BENCHMARK

SARDet-100K

The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.

10 PAPERS • 1 BENCHMARK

CHILI-100K

The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures. The crystal structures used in CHILI-100K are obtained from a curated subset from the Crystallography Open Database (COD) and has a broad chemical scope covering database entries for 68 metals and 11 non-metals.

1 PAPER • 8 BENCHMARKS

CHILI-3K

The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types. This dataset has a narrow chemical scope focused on an interesting part of chemical space with a lot of active research.

2 PAPERS • 8 BENCHMARKS

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

1 PAPER • NO BENCHMARKS YET

GEval for KGRC-RDF-star

This repository is an extension of GEval. This repository contains a (software) evaluation framework to perform evaluation and comparison on RDF-star graph embedding techniques. The gold standard datasets for evaluation were created from KGRC-RDF-star. Please see here.

1 PAPER • NO BENCHMARKS YET

KGRC-RDF-star

KGRC-RDF-star is an RDF-star dataset converted from KGRC-RDF, which is a Knowledge graph dataset of novel stories.

1 PAPER • NO BENCHMARKS YET

LinkedPapersWithCode

An RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications.

1 PAPER • NO BENCHMARKS YET

maze-dataset

This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498

2 PAPERS • NO BENCHMARKS YET

MolGrapher-Synthetic-300K

The set is created using molecule SMILES retrieved from the database PubChem. Images are then generated from SMILES using the molecule drawing library RDKit. The synthetic set is augmented at multiple levels:

1 PAPER • NO BENCHMARKS YET

USPTO-30K

We introduce USPTO-30K, a large-scale benchmark dataset of annotated molecule images, which overcomes these limitations. It is created using the pairs of images and MolFiles by the United States Patent and Trademark Office. Each molecule was independently selected among all the available documents from 2001 to 2020. The set consists of three subsets to decouple the study of clean molecules, molecules with abbreviations and large molecules.

1 PAPER • NO BENCHMARKS YET

Myket Android Application Install

This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.

1 PAPER • NO BENCHMARKS YET

UMVM

We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.

5 PAPERS • 7 BENCHMARKS

HatefulDiscussions

Multi-Modal Hate Speech Detection with Graph Context.

1 PAPER • NO BENCHMARKS YET

Genre2Movies

Genre2Movies (Compositional queries for Movie recommendation)

Genre annotations for movies The file genre2movies.csv contains genre-movie tuples based on Wikidata annotations (https://www.wikidata.org/).

1 PAPER • NO BENCHMARKS YET

TTE-A&O (Travel Time Estimation: Abakan and Omsk)

The dataset includes two parts corresponding to the cities of Abakan (65524 nodes, 340012 edges) and Omsk (231688 nodes, 1149492 edges). Along with the road network graph, it includes trip records represented as sequences of visited nodes (making the dataset suitable both for path-blind and path-aware settings). There are two types of target values for a regression task: real travel time and real length of a trip.

2 PAPERS • 1 BENCHMARK

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 PAPERS • NO BENCHMARKS YET

arXivCS

Source

2 PAPERS • 1 BENCHMARK

CIRO experimental results

Description This repository includes the experiment results, source code, and test data for Three Cs risk inference, using the CIRO (COVID-19 Infection Risk Ontology) and HermiT.

1 PAPER • NO BENCHMARKS YET

data_qe

data_qe (Federal Reserve Quantitative Easing Data)

This file contains the data and code for the publication "The Federal Reserve's Response to the Global Financial Crisis and Its Long-Term Impact: An Interrupted Time-Series Natural Experimental Analysis" by A. C. Kamkoum, 2023.

1 PAPER • NO BENCHMARKS YET

FireRisk (FireRisk: A Remote Sensing Dataset for Fire Risk Assessment)

In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total of 91 872 labelled images for fire risk assessment. This remote sensing dataset is labelled with the fire risk classes supplied by the Wildfire Hazard Potential (WHP) raster dataset, and remote sensing images are collected using the National Agriculture Imagery Program (NAIP), a high-resolution remote sensing imagery program. On FireRisk, we present benchmark performance for supervised and self-supervised representations, with Masked Autoencoders (MAE) pre-trained on ImageNet1k achieving the highest classification accuracy, 65.29%.

1 PAPER • 1 BENCHMARK

VirtualHome2KG

VirtualHome2KG is a system for constructing and augmenting knowledge graphs (KGs) of daily living activities using virtual space. We also provide an ontology to describe the structure of the KGs. We used VirtualHome as a platform of virtual space simulation. Thus, this repository is an extension of the virtualhome. Please see the original repository of the virtualhome for details of the Unity simulation.

2 PAPERS • NO BENCHMARKS YET

AutoFR Dataset

AutoFR Dataset is broken down by each site that we crawl within a zip file. It contains multiple different experiments that we conducted in our paper. The overall dataset contains 1042 sites that we crawled where we detected ads within the Top-5K.

1 PAPER • NO BENCHMARKS YET

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

13 PAPERS • 1 BENCHMARK

minesweeper

minesweeper is a synthetic graph emulating the eponymous game.

15 PAPERS • 1 BENCHMARK

questions

Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.

21 PAPERS • 1 BENCHMARK

roman-empire

Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.

21 PAPERS • 1 BENCHMARK

tolokers

Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.

15 PAPERS • 1 BENCHMARK

Argoverse 2 Motion Forecasting

The Argoverse 2 Motion Forecasting Dataset is a curated collection of 250,000 scenarios for training and validation. Each scenario is 11 seconds long and contains the 2D, birds-eye-view centroid and heading of each tracked object sampled at 10 Hz.

15 PAPERS • NO BENCHMARKS YET

ZeroKBC

ZeroKBC is comprehensive benchmark that covers all scenarios of zero-shot Knowledge Base Completion (KBC) task. It has 3 zero-shot scenarios with 8 fine-grained settings.

1 PAPER • NO BENCHMARKS YET

RoomEnv-v1

RoomEnv-v1 (The Room environment - v1)

The Room environment - v1 For the documentation of RoomEnv-v0, click the corresponding buttons.

1 PAPER • NO BENCHMARKS YET

HiAML

HiAML Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 4.6k CIFAR-10 networks with an accuracy range of [91.11%, 93.44%].

2 PAPERS • NO BENCHMARKS YET

Inception

Inception Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 580 CIFAR-10 networks with an accuracy range of [89.08%, 94.03%].

2 PAPERS • NO BENCHMARKS YET

Two-Path

Two-Path Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 6.9k CIFAR-10 networks with an accuracy range of [85.53%, 92.34%].

2 PAPERS • NO BENCHMARKS YET

BeGin

BeGin provides 23 benchmark scenarios for graph from 14 real-world datasets, which cover 12 combinations of the incremental settings and the levels of problem. In addition, BeGin provides various basic evaluation metrics for measuring the performances and final evalution metrics designed for continual learning.

1 PAPER • NO BENCHMARKS YET

WyzeRule

Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.

2 PAPERS • NO BENCHMARKS YET

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).

2 PAPERS • 1 BENCHMARK

pmuBAGE

pmuBAGE (the Benchmarking Assortment of Generated PMU Events) is a dataset that consists of almost 1000 instances of labeled event data to encourage benchmark evaluations on phasor measurement unit (PMU) data analytics. PMU data are challenging to obtain, especially those covering event periods. Nevertheless, power system problems have recently seen phenomenal advancements via data-driven machine learning solutions. A highly accessible standard benchmarking dataset would enable a drastic acceleration of the development of successful machine learning techniques in this field.

1 PAPER • NO BENCHMARKS YET

Chameleon(60%/20%/20% random splits)

Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.

15 PAPERS • 1 BENCHMARK

Cornell (60%/20%/20% random splits)

Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 2 BENCHMARKS

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Texas(60%/20%/20% random splits)

Node classification on Texas with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 1 BENCHMARK

Wisconsin(60%/20%/20% random splits)

Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

MarKG

MarKG (Multimodal analogical reasoning Knowledge Graph)

The MarKG dataset has 11,292 entities, 192 relations and 76,424 images, including 2,063 analogy entities and 27 analogy relations. The original intention of MarKG is to provide prior knowledge of analogy entities and relations for better multimodal analogical reasoning.

2 PAPERS • NO BENCHMARKS YET

doges-dogaresse (Doges and dogaresse of the Venetian Republic)

This is the list of all doges of the Venetian Republic, as well as their wives, if there's a record that they existed. They include name, surname if known, and date of their office, as well as the date of their weddings. Data has been extracted from the Wikipedia, with some errors fixed checking against other sources.

1 PAPER • NO BENCHMARKS YET

Datasets

243 dataset results for Graphs