A labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign).
37 PAPERS • NO BENCHMARKS YET
Useful for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total).
18 PAPERS • NO BENCHMARKS YET
NCBI Disease Corpus is a large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus. The corpus contains rich annotations, was developed by a team of 12 annotators (two people per annotation) and covers all sentences in a PubMed abstract. Disease mentions are categorized into Specific Disease, Disease Class, Composite Mention and Modifier categories.
10 PAPERS • NO BENCHMARKS YET
A novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users.
7 PAPERS • NO BENCHMARKS YET