The GDELT Project is a remarkable initiative that monitors our world by analyzing global news from various sources. Here are the key aspects of the GDELT dataset:
3 PAPERS • 1 BENCHMARK
The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of the Corpus has been annotated by three annotators by providing the following layers of annotations, each one characterizing a core aspect of scientific publications:
2 PAPERS • 2 BENCHMARKS
The Kinships dataset describes relationships between members of the Australian tribe Alyawarra and consists of 10,686 triples. It contains 104 entities representing members of the tribe and 26 relationship types that represent kinship terms such as Adiadya or Umbaidya.
2 PAPERS • NO BENCHMARKS YET
The Nations dataset is a small knowledge graph with 14 entities, 55 relations, and 1992 triples describing countries and their political relationships. This dataset is available for download from https://github.com/ZhenfengLei/KGDatasets.
The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated with argument components and argumentative relations.
1 PAPER • 2 BENCHMARKS
The Aristo Tuple KB contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints. The dataset was introduced by the paper Domain-Targeted, High Precision Knowledge Extraction.
1 PAPER • 1 BENCHMARK
CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.
1 PAPER • NO BENCHMARKS YET
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
Hypertention Disease Medication dataset.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.
This dataset gathers 14,857 entities, 133 relations, and entities corresponding tokenized text from PubMed. It contains 875,698 training pairs, 109,462 development pairs, and 109,462 test pairs.
SINS is a database of continuous real-life audio recordings in a home environment. The home is a vacation home and one person lived there during the recording period of over on week. It was collected using a network of 13 microphone arrays distributed over the multiple rooms. Each microphone array consisted of 4 linearly arranged microphones. Recordings were annotated based on the level of daily activities performed in the environment.