🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

69 dataset results for Node Classification

Twitter-HyDrug is a real-world hypergraph data that describes the drug trafficking communities on Twitter. We first crawl the metadata (275,884,694 posts and 40,780,721 users) through the official Twitter API from Dec 2020 to Aug 2021. Afterward, we generate a drug keyword list that covers 21 drug types that may cause drug overdose or drug addiction problems to filter the tweets that contain drug-relevant information. Based on the keyword list, we obtain 266,975 filtered drug-relevant posts by 54,680 users. Moreover, we define six types of drug communities, i.e., cannabis, opioid, hallucinogen, stimulant, depressant, and others communities, based on the drug functions. Six researchers spent 62 days annotating these Twitter users into six communities based on the annotation rules discussed in the next section. With the specific criteria, six researchers annotated the filtered metadata separately. For these Twitter users with disagreed labels, we conducted further discussion among annota

1 PAPER • 1 BENCHMARK

FDCompCN

A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.

1 PAPER • 1 BENCHMARK

SAGC-A68

SAGC-A68 (A space access graph dataset for the classification of spaces and space elements in apartment buildings)

The analysis of building models for usable area, building safety, and energy efficiency requires accurate classification data of spaces and space elements. To reduce input model preparation effort and errors, automated classification of spaces and space elements is desirable. Although existing space function classifiers use space adjacency or connectivity graphs as input, the application of Graph Deep Learning (GDL) to space layout element classification has not been extensively researched due to the lack of suitable datasets. To bridge this gap, we introduce a dataset named SAGC-A68, which comprises access graphs automatically generated from 68 digital 3D models of space layouts of apartment buildings designed or built between 1952 and 2019 in 13 countries. Each access graph contains nodes representing spaces and space elements and edges representing the connection between them. Nodes are uniquely identified and characterized by 16 features including “Position X”, “Position Y”, “Posit

1 PAPER • NO BENCHMARKS YET

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

13 PAPERS • 1 BENCHMARK

minesweeper

minesweeper is a synthetic graph emulating the eponymous game.

15 PAPERS • 1 BENCHMARK

questions

Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.

20 PAPERS • 1 BENCHMARK

roman-empire

Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.

21 PAPERS • 1 BENCHMARK

tolokers

Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.

15 PAPERS • 1 BENCHMARK

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).

2 PAPERS • 1 BENCHMARK

Cornell (60%/20%/20% random splits)

Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 2 BENCHMARKS

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 1 BENCHMARK

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 1 BENCHMARK

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Long Range Graph Benchmark (LRGB)

The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.

41 PAPERS • 5 BENCHMARKS

CellTypeGraph Benchmark

Classifying all cells in an organ is a relevant and difficult problem from plant developmental biology. We here abstract the problem into a new benchmark for node classification in a geo-referenced graph. Solving it requires learning the spatial layout of the organ including symmetries. To allow the convenient testing of new geometrical learning methods, the benchmark of Arabidopsis thaliana ovules is made available as a PyTorch data loader, along with a large number of precomputed features.

1 PAPER • 1 BENCHMARK

HeriGraph (Multimodal Machine Learning Datasets on Graphs of Heritage Values and Attributes)

The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.

1 PAPER • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

MuMiN-large

This is the large version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

MuMiN-medium

This is the medium version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

MuMiN-small

This is the small version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

Penn94

Node classification on Penn94

46 PAPERS • 2 BENCHMARKS

genius

node classification on genius

35 PAPERS • 2 BENCHMARKS

twitch-gamers

node classification on twitch-gamers

23 PAPERS • 2 BENCHMARKS

wiki

The dataset wiki consists of Wikipedia articles, where the goal is to predict the total page views of each article. # Nodes: 1,925,342, # Edges: 303,434,860, # Features: 600, # Classes: 5.

16 PAPERS • 1 BENCHMARK

OGB-LSC (OGB Large-Scale Challenge)

OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.

31 PAPERS • 3 BENCHMARKS

Aesthetic Visual Analysis

Aesthetic Visual Analysis is a dataset for aesthetic image assessment that contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style.

11 PAPERS • 3 BENCHMARKS

Amazon-Fraud (Multi-relational Graph Dataset for Amazon Fraudulent Account Detection)

Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

6 PAPERS • 2 BENCHMARKS

Yelp-Fraud (Multi-relational Graph Dataset for Yelp Spam Review Detection)

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

10 PAPERS • 2 BENCHMARKS

Wiki-CS

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

73 PAPERS • 2 BENCHMARKS

MAG-Scholar-C

MAG-Scholar-C is constructed by Bojchevski et al. based on Microsoft Academic Graph (MAG), in which nodes refer to papers, edges represent citation relations among papers and features are bag-of-words of paper abstracts.

4 PAPERS • NO BENCHMARKS YET

CLUSTER

CLUSTER is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. CLUSTER aims at identifying community clusters in a semi-supervised setting.

134 PAPERS • 1 BENCHMARK

PATTERN

PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.

121 PAPERS • 1 BENCHMARK

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

18 PAPERS • 2 BENCHMARKS

Citeseer (48%/32%/20% fixed splits)

Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Cora (48%/32%/20% fixed splits)

Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Cornell (48%/32%/20% fixed splits)

Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.

16 PAPERS • 2 BENCHMARKS

Film(48%/32%/20% fixed splits)

Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

PubMed (48%/32%/20% fixed splits)

Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

17 PAPERS • 2 BENCHMARKS

Texas (48%/32%/20% fixed splits)

Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

Wisconsin (48%/32%/20% fixed splits)

Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 2 BENCHMARKS

Facebook Page-Page

This webgraph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories.

7 PAPERS • NO BENCHMARKS YET

Wiki Squirrel (Wikipedia Squirrel)

The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

158 PAPERS • 2 BENCHMARKS

Deezer User Networks

The data was collected from the music streaming service Deezer (November 2017). These datasets represent friendship networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.

3 PAPERS • NO BENCHMARKS YET

AVA (Atomic Visual Actions)

AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:

94 PAPERS • 7 BENCHMARKS

PPI

PPI (Protein-Protein Interactions (PPI))

protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to a different human tissue [41]. positional gene sets are used, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database [34]. The average graph contains 2373 nodes, with an average degree of 28.8.

285 PAPERS • 2 BENCHMARKS

Brazil Air-Traffic

8 PAPERS • 2 BENCHMARKS

USA Air-Traffic

Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.

9 PAPERS • 1 BENCHMARK