This relational database consists of 24 unique names in two families (they have equivalent structures).
10 PAPERS • NO BENCHMARKS YET
The Dataset is part of the KELM corpus
10 PAPERS • 1 BENCHMARK
Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
10 PAPERS • 2 BENCHMARKS
Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to Astro Physics category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.
9 PAPERS • 2 BENCHMARKS
Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.
9 PAPERS • 1 BENCHMARK
The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.
Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
8 PAPERS • 1 BENCHMARK
Brazil Air-Traffic
8 PAPERS • 2 BENCHMARKS
Regression dataset for molecular docking scores (predicted molecule-protein binding affinity). Contains ~250,000 molecules against 58 protein targets.
8 PAPERS • NO BENCHMARKS YET
The Ecoli dataset is a dataset for protein localization. It contains 336 E.coli proteins split into 8 different classes.
7 PAPERS • NO BENCHMARKS YET
This webgraph is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories.
We collected data about Facebook pages (November 2017). These datasets represent blue verified Facebook page networks of different categories. Nodes represent the pages and edges are mutual likes among them. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. We included 8 different distinct types of pages. These are listed below. For each dataset we listed the number of nodes an edges.
GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and text-to-knowledge graph (T2G) conversion. It is introduced in the paper "GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation" by Zhijing Jin, Qipeng Guo, Xipeng Qiu, and Zheng Zhang at COLING 2020.
7 PAPERS • 2 BENCHMARKS
New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included in the AMR 2.0 training set – consisting of excerpts from newswires and online forum.
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
6 PAPERS • 2 BENCHMARKS
Random sampled instances of the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) for 20, 50 and 100 customer nodes.
6 PAPERS • NO BENCHMARKS YET
Mindboggle is a large publicly available dataset of manually labeled brain MRI. It consists of 101 subjects collected from different sites, with cortical meshes varying from 102K to 185K vertices. Each brain surface contains 25 or 31 manually labeled parcels.
1.0 Version of OpenEA benchmark datasets. Please use the updated 2.0 version, that has been subsequently released.
6 PAPERS • 4 BENCHMARKS
RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them.
5 PAPERS • 1 BENCHMARK
SSN (short for Semantic Scholar Network) is a scientific papers summarization dataset which contains 141K research papers in different domains and 661K citation relationships. The entire dataset constitutes a large connected citation graph.
5 PAPERS • NO BENCHMARKS YET
This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. In particular, it focuses on small instances which prove to be challenging for one or more state-of-the-art TSP algorithms. These instances are based on difficult instances of Hamiltonian cycle problem (HCP). This includes instances from literature, specially modified randomly generated instances, and instances arising from the conversion of other difficult problems to HCP.
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
5 PAPERS • 7 BENCHMARKS
The Vent dataset is a large annotated dataset of text, emotions, and social connections. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has an associated emotion. There are 705 different emotions, organized in 63 "emotion categories", forming a two-level taxonomy of affects.
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, Wireless Communication, Data Mining). An heterogeneous graph is constructed, which comprises 3025 papers, 5835 authors, and 56 subjects. Paper features correspond to elements of a bag-of-words represented of keywords.
4 PAPERS • 1 BENCHMARK
Chickenpox Cases in Hungary is a spatio-temporal dataset of weekly chickenpox (childhood disease) cases from Hungary. It can be used as a longitudinal dataset for benchmarking the predictive performance of spatiotemporal graph neural network architectures. The dataset consists of a county-level adjacency matrix and time series of the county-level reported cases between 2005 and 2015. There are 2 specific related tasks:
4 PAPERS • NO BENCHMARKS YET
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Greek) is a subset of DPB-5L with Greek KG.
Graph Robustness Benchmark (GRB) provides scalable, unified, modular, and reproducible evaluation on the adversarial robustness of graph machine learning models. GRB has elaborated datasets, unified evaluation pipeline, modular coding framework, and reproducible leaderboards, which facilitate the developments of graph adversarial learning, summarizing existing progress and generating insights into future research.
The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.
InferWiki is a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns. First, each testing sample is predictable with supportive data in the training set. Second, InferWiki initiates the evaluation following the open-world assumption and improves the inferential difficulty of the closed-world assumption, by providing manually annotated negative and unknown triples. Third, the dataset includes various inference patterns (e.g., reasoning path length and types) for comprehensive evaluation.
KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
This is the set of instances use in the PACE 2018 competition, of optimal Steiner Tree computation. The instances are grouped into three tracks of 200 instances each, except for the third track which is only 199 instances. Each instance is an undirected graph.
ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and proteins entities. ProteinKG25 contains 4,990,097 triplets (4,879,951 Protein-GO triplets and 110,146 GO-GO triplets), 612,483 entities (565,254 proteins and 47,229 GO terms) and 31 relations.
Amazon Fine Foods is a dataset that consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review.
3 PAPERS • NO BENCHMARKS YET
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
3 PAPERS • 2 BENCHMARKS
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (English) is a subset of DPB-5L with English KG.
3 PAPERS • 1 BENCHMARK
How and where proteins interface with one another can ultimately impact the proteins' functions along with a range of other biological processes. As such, precise computational methods for protein interface prediction (PIP) come highly sought after as they could yield significant advances in drug discovery and design as well as protein function analysis. However, the traditional benchmark dataset for this task, Docking Benchmark 5 (DB5), contains only a paltry 230 complexes for training, validating, and testing different machine learning algorithms. In this work, we expand on a dataset recently introduced for this task, the Database of Interacting Protein Structures (DIPS), to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex, whereas DIPS-Plus now includes a plethora of new residue-level
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (French) is a subset of DPB-5L with French KG.
The data was collected from the music streaming service Deezer (November 2017). These datasets represent friendship networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.
The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.
This is a catalogue and repository of network datasets with the aim of aiding scientific research.
The OQMD is a database of DFT calculated thermodynamic and structural properties of one million materials, created in Chris Wolverton's group at Northwestern University.
3 PAPERS • 4 BENCHMARKS
Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.
This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were inspired by the UNL project to include this novel, so that different groups could compare representations on the same text.