Social Media Mining for Health (SMM4H) Shared Task is a massive data source for biomedical and public health applications.
41 PAPERS • NO BENCHMARKS YET
The twitter emoji dataset obtained from CodaLab comprises of 50 thousand tweets along with the associated emoji label. Each tweet in the dataset has a corresponding numerical label which maps to a specific emoji. The emojis are of the 20 most frequent emojis and hence the labels range from 0 to 19
4 PAPERS • NO BENCHMARKS YET
The COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset is generated by the fusion of three publicly available datasets: COVID-19 cxr image, Radiological Society of North America (RSNA), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC). The dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train an evaulate deep learning and machine learning models as binary and multi-class classification problem.
1 PAPER • NO BENCHMARKS YET
CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.
DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.
Two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible.
The ROAD dataset is made up of observations from the Low Frequency Array (LOFAR) telescope. LOFAR is comprised of 52 stations across Europe, where each station is an array of 96 dual polarisation low-band antennas (LBA) in the 10–90 MHz range and 48 or 96 dual polarisation high-band antenna antennas (HBA) in the 110–250 MHz range. The data are four dimensional, with the dimensions corresponding to time, frequency, polarisation, and station. dictate the array configuration (i.e. the number of stations used), the number of frequency channels (Nf), the time sampling, as well as the overall integration time (Nt) of the observing session. Furthermore, the dual-polarisation of the antennas results in a correlation product (Npol) of size 4. The ROAD dataset contains ten classes that describe various system-wide phenomena and anomalies from data obtained by the LOFAR telescope. These classes are categorised into four groups: data processing system failures, electronic anomalies, environmental
SF-MASK is a collection made from 20k low-resolution images exported from diverse and heterogeneous datasets, ranging from 7 x 7 to 64 x 64 pixel resolution. An accurate visualization of this collection, through counting grids, made it possible to highlight gaps in the variety of poses assumed by the heads of the pedestrians.
SmokEng is a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet.
The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of Intrusion Detection Systems (IDS). It presents a dual structure: one part provides a tabular representation of extracted features in CSV format, while the other offers raw network traffic data for each type of traffic in PCAP files. This rich dataset captures both benign and malicious network scenarios, serving as an invaluable resource for researchers in the machine learning field.
1 PAPER • 3 BENCHMARKS
This dataset is comprised of the dynamic analysis reports generated by CAPEv2, from both malware and goodware. We source the goodware as they do in Dambra et al. (https://arxiv.org/abs/2307.14657), where trough the community-maintained packages of Chocolatey they create a dataset that spans 2012 to 2020. The malware are sourced from VirusTotal, namely samples of Portable Executable from 2017 - 2020 that they release for academic purposes. In total, the dataset we assembled contains 26,200 PE samples: 8,600 (33\%) goodware and 17,675 (67\%) malware.
0 PAPER • NO BENCHMARKS YET
Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier 75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин