🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

126 dataset results for Classification

Niramai Oncho Dataset (Niramai Onchocerciasis/RiverBlindness Dataset)

Onchocerciasis is causing blindness in over half a million people in the world today. Drug development for the disease is crippled as there is no way of measuring effectiveness of the drug without an invasive procedure. Drug efficacy measurement through assessment of viability of onchocerca worms requires the patients to undergo nodulectomy which is invasive, expensive, time-consuming, skill-dependent, infrastructure dependent and lengthy process.

1 PAPER • NO BENCHMARKS YET

Overall-Driving-Behavior-Recognition-By-Smartphone

Overall-Driving-Behavior-Recognition-By-Smartphone (Hamid Reza Eftekhari)

Monitoring and evaluating of driving behavior is the main goal of this paper that encourage us to develop a new system based on Inertial Measurement Unit (IMU) sensors of smartphones. In this system, a hybrid of Discrete Wavelet Transformation (DWT) and Adaptive Neuro Fuzzy Inference System (ANFIS) is used to recognize overall driving behaviors. The behaviors are classified into the safe, the semi-aggressive, and the aggressive classes that are adopted with Driver Anger Scale (DAS) self-reported questionnaire results. The proposed system extracts four features from IMU sensors in the forms of time series. They are decomposed by DWT in two levels and their energies are sent to six ANFISs. Each ANFIS models the different perception about driving behavior under uncertain knowledge and returns the similarity or dissimilarity between driving behaviors. The results of these six ANFISs are combined by three different decision fusion approaches. Results show that Coiflet-2 is the most suitable

1 PAPER • 1 BENCHMARK

RGZ EMU: Semantic Taxonomy

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 PAPER • NO BENCHMARKS YET

Raw-Microscopy and Raw-Drone

Raw-Microscopy:

1 PAPER • NO BENCHMARKS YET

Reddit Ideology Database

Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles.

1 PAPER • 1 BENCHMARK

Regensburg Pediatric Appendicitis Dataset

This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. Multiple abdominal B-mode ultrasound images were acquired for most patients, with the number of views varying from 1 to 15. The images depict various regions of interest, such as the abdomen’s right lower quadrant, appendix, intestines, lymph nodes and reproductive organs. Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). The study was approved by the Ethics Committee of the University of Regensburg (

1 PAPER • NO BENCHMARKS YET

SF-MASK (Small Face MASK)

SF-MASK is a collection made from 20k low-resolution images exported from diverse and heterogeneous datasets, ranging from 7 x 7 to 64 x 64 pixel resolution. An accurate visualization of this collection, through counting grids, made it possible to highlight gaps in the variety of poses assumed by the heads of the pedestrians.

1 PAPER • NO BENCHMARKS YET

SHADR

SHADR (sythetic SDoH Human Annotated Demographic Robustness dataset (SHADR))

SDoH Human Annotated Demoographic Robustness (SHADR) Dataset Overview The Social determinants of health (SDoH) play a pivotal role in determining patient outcomes. However, their documentation in electronic health records (EHR) remains incomplete. This dataset was created from a study examining the capability of large language models in extracting SDoH from the free text sections of EHRs. Furthermore, the study delved into the potential of synthetic clinical text to bolster the extraction process of these scarcely documented, yet crucial, clinical data.

1 PAPER • NO BENCHMARKS YET

SHD - Adding (Spiking Heidelberg Digits - Adding)

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled uniformly at random from the SHD dataset and concatenated, with the target being the sum of the digits (irrespective of language). The train and test split remain the same, with the test set consisting of 16k such samples based on the SHD test set.

1 PAPER • 1 BENCHMARK

STEDUCOV: A DATASET ON STANCE DETECTION IN TWEETS TOWARDS ONLINE EDUCATION DURING COVID-19 PANDEMIC

StEduCov, a dataset annotated for stances toward online education during the COVID-19 pandemic. StEduCov has 17,097 tweets gathered over 15 months, from March 2020 to May 2021, using Twitter API. The tweets are manually annotated into agree, disagree or neutral classes. We used a set of relevant hashtags and keywords. Specifically, we utilised a combination of hashtags, such as '#COVID 19' or '#Coronavirus' with keywords, such as 'education', 'online learning', 'distance learning' and 'remote learning'. To ensure high annotation quality, three different annotators annotated each tweet and at least one of the reviewers from three judges revised it. They were guided by some instructions, such as that in the case of disagree class, there should be a clear negative statement about online education or its impact. Also, if the tweet is negative but refers to other people (e.g. 'my children hate online learning').

1 PAPER • 1 BENCHMARK

Satellite

The Satellite dataset forms a practical VFL scenario for location identification based on satellite imagery. Each AOI, with its unique location identifier, is captured by 16 satellite visits. Assuming each visit is carried out by a distinct satellite organization, these organizations aim to collectively train a model to classify the land type of the location without sharing original images. The Satellite dataset encompasses four land types as labels, namely Amnesty POI (4.8%), ASMSpotter (8.9%), Landcover (61.3%), and UNHCR (25.0%), making the task a 4-class classification problem of 3927 locations, containing 62,832 images across 16 parties, simulating a practical VFL scenario of collaborative location identification via multiple satellites.

1 PAPER • NO BENCHMARKS YET

SciHTC

SciHTC is a dataset for hierarchical multi-label text classification (HMLTC) of scientific papers which contains 186,160 papers and 1,233 categories from the ACM CCS tree.

1 PAPER • NO BENCHMARKS YET

Simulated micro-Doppler Signatures

Simulated pulse Doppler radar signatures for four classes of helicopter-like targets. The classes differ in the number of rotating blades each kind of target carries, thus each class translates into a specific modulation pattern on the Doppler signature. Doppler signatures are a typical feature used to achieve radar targets discrimination. This dataset was generated using a simple open-source MATLAB simulation code, which can be easily modified to generate custom datasets with more classes and increased intra-class diversity.

1 PAPER • NO BENCHMARKS YET

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708

1 PAPER • 1 BENCHMARK

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

1 PAPER • NO BENCHMARKS YET

Tinto (Tinto: Multisensor Benchmark for 3D Hyperspectral Point Cloud Segmentation in the Geosciences)

The increasing use of deep learning techniques has reduced interpretation time and, ideally, reduced interpreter bias by automatically deriving geological maps from digital outcrop models. However, accurate validation of these automated mapping approaches is a significant challenge due to the subjective nature of geological mapping and the difficulty in collecting quantitative validation data. Additionally, many state-of-the-art deep learning methods are limited to 2D image data, which is insufficient for 3D digital outcrops, such as hyperclouds. To address these challenges, we present Tinto, a multi-sensor benchmark digital outcrop dataset designed to facilitate the development and validation of deep learning approaches for geological mapping, especially for non-structured 3D data like point clouds. Tinto comprises two complementary sets: 1) a real digital outcrop model from Corta Atalaya (Spain), with spectral attributes and ground-truth data, and 2) a synthetic twin that uses latent

1 PAPER • NO BENCHMARKS YET

Two Coiling Spirals

The two Coiling Spiral is a 2d classification dataset composed of two classes; each spiral corresponds to one class.

1 PAPER • NO BENCHMARKS YET

Vulnerable Verified Smart Contracts

Vulnerable Verified Smart Contracts is a dataset of real vulnerable Ethereum smart contracts. Based on the manually labeled Benchmark dataset of Solidity smart contracts. A total of 609 vulnerable contracts are provided, containing 1,117 vulnerabilities.

1 PAPER • NO BENCHMARKS YET

WHYSHIFT

In our benchmark WHYSHIFT, we explore distribution shifts on 5 real-world tabular datasets from the economic and traffic sectors with natural spatiotemporal distribution shifts.We only pick 7 typical settings out of 22 settings and select only one representative target domain for each setting. In our benchmark, we specify the distribution shift pattern for each setting, and we provide the tools to identify risky regions with large $Y|X$ shifts and to diagnose the performance degradation.

1 PAPER • NO BENCHMARKS YET

WINGBEATS (MOSQUITO WINGBEAT RECORDINGS)

Context The database contains wav recordings from the same optical sensor inserted in-turn into six insectary boxes containing only one mosquito species of both sexes (about 200-300 flying mosquitoes in each cage). As the mosquitoes fly randomly through the sensor their wingbeat partially occludes the light from the transmitter to the receiver. The light fluctuation recorded is modulated by the wingbeat of the insect. The resulting signal is pseudo-acoustic, meaning that it sounds exactly like a microphone recording but has been acquired using optical means (however, not vision based). Insect Biometrics, in the context of our work, is a measurable behavioral characteristic of flying insects. Biometric identifiers are related to the shape of the body (main body size, wing shape, wingbeat frequency, pattern movement of the wings). Biometric identification methods use biometric characteristics or traits to verify species/sex identities when insects access endpoint traps following a bait.

1 PAPER • NO BENCHMARKS YET

ALFI (Annotations for Label-Free Images)

ALFI (Annotations for Label-Free Images) is a dataset of images and annotations for label-free microscopy imaging. It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community through figshare.

0 PAPER • NO BENCHMARKS YET

ALTA 2022 Shared Task

ALTA 2022 Shared Task (PIBOSO Sentence classification)

This dataset is described in the ALTA 2022 Shared Task and associated CodaLab competition.

0 PAPER • NO BENCHMARKS YET

ALTA 2023 Shared Task

ALTA 2023 Shared Task (Discriminate between human-authored and synthetic text generated by Large Language Models (LLMs))

This dataset is described in the ALTA 2023 Shared Task and associated CodaLab competition.

0 PAPER • NO BENCHMARKS YET

Big-Five Backstage

The dataset consists of 3265 text samples corresponding to the concatenation of lines spoken by fictional characters. Texts are extracted from 400 theatre plays written by 132 different authors. Overall, it contains 3419136 words in total with a mean equal to 1047.2 words per character. Text entries have binary labels representing gender of a character (Male or Female) and their five personality traits (Extraversion, Agreeableness, Openness, Neuroticism, Conscientiousness). The auxiliary part of the dataset includes author-level labels reflecting their gender, country of origin, and years of life.

0 PAPER • NO BENCHMARKS YET

Compositional Visual Reasoning (CVR)

A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abs

0 PAPER • NO BENCHMARKS YET

I-CARE: International Cardiac Arrest REsearch consortium Database

The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalogram (EEG) and electrocardiogram (ECG) recordings from comatose patients following cardiac arrest. The patients were admitted to an intensive care unit (ICU) in one of seven academic hospitals in the U.S. and Europe and monitored for several hours to several days. The long-term neurological function of the patients was determined using the Cerebral Performance Category scale.

0 PAPER • NO BENCHMARKS YET

Mudestreda (Mudestreda Multimodal Device State Recognition Dataset)

Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.

0 PAPER • NO BENCHMARKS YET

OCT5k

The thickness and appearance of retinal layers are essential markers for diagnosing and studying eye diseases. Despite the increasing availability of imaging devices to scan and store large amounts of data, analyzing retinal images and generating trial endpoints has remained a manual, error-prone, and time-consuming task. In particular, the lack of large amounts of high-quality labels for different diseases hinders the development of automated algorithms. Therefore, we have compiled 5016 pixel-wise manual labels for 1672 optical coherence tomography (OCT) scans featuring two different diseases as well as healthy subjects to help democratize the process of developing novel automatic techniques. We also collected 4698 bounding box annotations for a subset of 566 scans across 9 classes of disease biomarker. Due to variations in retinal morphology, intensity range, and changes in contrast and brightness, designing segmentation and detection methods that can generalize to different disease

0 PAPER • NO BENCHMARKS YET

PINet

We propose a new light field image database called “PINet” inheriting the hierarchical structure from WordNet. It consists of 7549 LIs captured by Lytro Illum, which is much larger than the existing databases. The images are manually annotated to 178 categories according to WordNet, such as cat, camel, bottle, fans, etc. The registered depth maps are also provided. Each image is generated by processing the raw LI from the camera by Light Field Toolbox v0.4 for demosaicing and devignetting.

0 PAPER • NO BENCHMARKS YET

Semeion

Semeion (Semeion Handwritten Digit Data Set)

1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.

0 PAPER • NO BENCHMARKS YET

Datasets

126 dataset results for Classification