🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

124 dataset results for Classification

Urban is one of the most widely used hyperspectral data used in the hyperspectral unmixing study. There are 307x307 pixels, each of which corresponds to a 2x2 m2 area. In this image, there are 210 wavelengths ranging from 400 nm to 2500 nm, resulting in a spectral resolution of 10 nm. After the channels 1-4, 76, 87, 101-111, 136-153 and 198-210 are removed (due to dense water vapor and atmospheric effects), we remain 162 channels (this is a common preprocess for hyperspectral unmixing analyses). There are three versions of ground truth, which contain 4, 5 and 6 endmembers respectively, which are introduced in the ground truth.

2 PAPERS • NO BENCHMARKS YET

ABUZZ (Citizen-based mosquito monitoring system)

As part of our policy to openly share all data from this project, we have included a downloadable package comprising all acoustic data collected over the course of this work. This includes acoustic recordings from 20 different species of mosquitoes, using a variety of mobile phones for each. This data can be downloaded from the online repository on dryad.org. The supplementary audio files are not included in this package, and may be downloaded separately.

1 PAPER • NO BENCHMARKS YET

ACL-Fig

ACL-Fig is a large-scale automatically annotated corpus consisting of 112,052 scientific figures extracted from 56K research papers in the ACL Anthology. The ACL-Fig-pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories.

1 PAPER • NO BENCHMARKS YET

AjwaOrMedjool

AjwaOrMedjool (AjwaOrMedjool: a binary balanced dataset to teach machine learning‏)

The dataset contains three subsets:

1 PAPER • NO BENCHMARKS YET

BFN

BFN (Backdoored Face-Networks Dataset)

This database is a database of backdoored neural networks intended for face recognition. The networks are of the FaceNet architecture and are trained on Casia-WebFace, with and without additional samples (which are the source of the backdoor). More information regarding backdoors and the project within which this fits can be found in the public release of the source code : https://gitlab.idiap.ch/bob/bob.paper.backdoored_facenets.biosig2022.

1 PAPER • NO BENCHMARKS YET

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on

1 PAPER • NO BENCHMARKS YET

Bengali Ekman's Six Basic Emotions Corpus

The dataset contains 36000 Bangla data based on Ekman's six basic emotions. This data was first introduced in the paper Alternative non-BERT model choices for the textual classification in low-resource languages and environments. The whole dataset is balanced and evenly distributed among all the six classes.

1 PAPER • 1 BENCHMARK

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t

1 PAPER • NO BENCHMARKS YET

Burr classification images

Original images and images with RUSTICO filters applied

1 PAPER • 1 BENCHMARK

CORBEL (Conveyor belt pressure signal dataset))

Dataset included measuring static tension under 2 kg load in different points of the CB and measurements in dynamic conditions. The latter conditions presumed the range of the linear belt speeds between nu_1 = 0.5 and nu_max = 1.7 m/s. 400 Hz unified sampling frequency for the experiments. It corresponded with 140 samples.

1 PAPER • 1 BENCHMARK

CRC100K

CRC100K (100,000 histological images of human colorectal cancer and healthy tissue)

This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue. All images are 224x224 pixels (px) at 0.5 microns per pixel (MPP). For tissue classification; the classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM). The images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.

1 PAPER • NO BENCHMARKS YET

CRCDX

CRCDX (TCGA-CRC-DX)

Histological images of colorectal cancer, derived from the TCGA database

1 PAPER • NO BENCHMARKS YET

CVE (Common Vulnerabilities and Exposures)

CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.

1 PAPER • NO BENCHMARKS YET

Cards Against Humanity

A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.

1 PAPER • NO BENCHMARKS YET

Colors

A large dataset of color names and their respective RGB values stores in CSV.

1 PAPER • 1 BENCHMARK

DIGITal (Digitally Generated Numerals)

Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.

1 PAPER • NO BENCHMARKS YET

Daily and Sports Activities

The dataset comprises motion sensor data of 19 daily and sports activities each performed by 8 subjects in their own style for 5 minutes. Five Xsens MTx units are used on the torso, arms, and legs.

1 PAPER • NO BENCHMARKS YET

DeepGraviLens

DeepGraviLens is a data set of simulated gravitational lenses consisting of images associated with brightness variation time series. In this dataset, both non-transient and transient phenomena (supernovae explosions) are simulated.

1 PAPER • NO BENCHMARKS YET

DeepParliament

DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.

1 PAPER • NO BENCHMARKS YET

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 PAPER • NO BENCHMARKS YET

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

1 PAPER • NO BENCHMARKS YET

Food Recall Incidents Dataset

The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:

1 PAPER • NO BENCHMARKS YET

FracAtlas (A Dataset for Fracture Classification, Localization and Segmentation of Musculoskeletal Radiographs)

FractureAtlas is a musculoskeletal bone fracture dataset with annotations for deep learning tasks like classification, localization, and segmentation. The dataset contains a total of 4,083 X-Ray images with annotation in COCO, VGG, YOLO, and Pascal VOC format. This dataset is made freely available for any purpose. The data provided within this work are free to copy, share or redistribute in any medium or format. The data might be adapted, remixed, transformed, and built upon. The dataset is licensed under a CC-BY 4.0 license. It should be noted that to use the dataset correctly, one needs to have knowledge of medical and radiology fields to understand the results and make conclusions based on the dataset. It's also important to consider the possibility of labeling errors.

1 PAPER • NO BENCHMARKS YET

GLAMI-1M (A Multilingual Image-Text Fashion Dataset)

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.

1 PAPER • 1 BENCHMARK

Gambling Address Dataset

Gambling Address Dataset is a collection of 10,423 gambling addresses that have transactions with gambling contracts. Moreover, 51,004 non-gambling addresses are also selected (such as exchanges, wallet addresses, etc.), making the gambling address dataset more complete. In the dataset, accounts are used to refer to addresses (e.g. 0xd1ce...edec95), where 1, 0, and -1 represent the gamble, non-gamble, and other types, respectively.

1 PAPER • NO BENCHMARKS YET

Gambling Contract Dataset

Gambling Contract Dataset is a collection of 260 gambling smart contracts from decentralized gambling websites, such as Dicether, Degens. At the same time, in order to construct the negative samples required for training, 1040 smart contracts that are not involved in gambling (e.g., erc20, erc721, mixer, etc.) are selected . In the dataset, accounts are used to refer to contracts (e.g. 0x3fe2b...f8a33f), where 1, 0, and -1 to represent the gamble, non-gamble, and other types, respectively.

1 PAPER • NO BENCHMARKS YET

Graph dataset MCF-7

Graph dataset MCF-7 (MCF-7)

Dataset introduced by Xifeng Yan et al.

1 PAPER • NO BENCHMARKS YET

Graph dataset MOLT-4

Graph dataset MOLT-4 (MOLT-4)

Dataset introduced by Xifeng Yan et al.

1 PAPER • NO BENCHMARKS YET

HOWS (HOWS-CL-25)

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.

1 PAPER • 2 BENCHMARKS

HRPlanesV2

HRPlanesV2 (HRPlanesv2 - High Resolution Satellite Imagery for Aircraft Detection)

The HRPlanesv2 dataset contains 2120 VHR Google Earth images. To further improve experiment results, images of airports from many different regions with various uses (civil/military/joint) selected and labeled. A total of 14,335 aircrafts have been labelled. Each image is stored as a ".jpg" file of size 4800 x 2703 pixels and each label is stored as YOLO ".txt" format. Dataset has been split in three parts as 70% train, %20 validation and test. The aircrafts in the images in the train and validation datasets have a percentage of 80 or more in size. Link: https://github.com/dilsadunsal/HRPlanesv2-Data-Set

1 PAPER • NO BENCHMARKS YET

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel tasks of multimodal figurative understanding and preference.

1 PAPER • 2 BENCHMARKS

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 𝘵𝘰 𝘐𝘮𝘢𝘨𝘦𝘕𝘦𝘵.

1 PAPER • 1 BENCHMARK

Industrial Benchmark Dataset for Customer Escalation Prediction

This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

1 PAPER • NO BENCHMARKS YET

Kepler Exoplanet Search Results

Context The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission.

1 PAPER • 1 BENCHMARK

LEPISZCZE

LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.

1 PAPER • NO BENCHMARKS YET

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and veins. It also include labels for glaucomatous / healthy, differentiating between normal tension glaucoma (NAG) and primary open angle glaucoma (POAG).

1 PAPER • 1 BENCHMARK

LLeQA (Long-form Legal Question Answering)

LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.

1 PAPER • NO BENCHMARKS YET

Lindenthal Camera Traps

This data set contains 775 video sequences, captured in the wildlife park Lindenthal (Cologne, Germany) as part of the AMMOD project, using an Intel RealSense D435 stereo camera. In addition to color and infrared images, the D435 is able to infer the distance (or “depth”) to objects in the scene using stereo vision. Observed animals include various birds (at daytime) and mammals such as deer, goats, sheep, donkeys, and foxes (primarily at nighttime). A subset of 412 images is annotated with a total of 1038 individual animal annotations, including instance masks, bounding boxes, class labels, and corresponding track IDs to identify the same individual over the entire video.

1 PAPER • NO BENCHMARKS YET

MCSI

MCSI (Mpox Close Skin Images)

The Mpox Close Skin Images dataset (MCSI) is a collection of skin images obtained from diverse public sources, that we accurately pre-processed (i.e., cropped and zoomed) in order to focus the skin lesion (if present), and to evaluate Machine Learning models aimed at detecting different pathologies from skin lesion pictures taken with smartphone cameras. It includes a total of 400 pictures homogeneously divided in 4 different classes: mpox, which contains samples of mpox (formerly Monkeypox) skin lesions; chickenpox, with samples of chickenpox cases; acne, containing samples of acne at different severity levels; and healthy, which contains samples of skin without any evident symptoms. This repository is part of the supplementary material accompanying the paper named: A Transfer Learning and Explainable Solution to Detect mpox from Smartphones images.

1 PAPER • NO BENCHMARKS YET

MapReader Data

MapReader Data (in GeoHumanities workshop, SIGSPATIAL 2022)

MapReader in GeoHumanities workshop (SIGSPATIAL 2022): Gold standards and outputs

1 PAPER • NO BENCHMARKS YET

MiST

MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.

1 PAPER • NO BENCHMARKS YET

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect patterns, a total of 38 defect patterns.

1 PAPER • 2 BENCHMARKS

Morphological Classification of Galaxies

Dataset can be used by anyone who is interested to perform morphological classification of galaxies. Originally dataset provided by Kaggle user Jay Lin (https://www.kaggle.com/jay1985) 4 years ago. Dataset was used in conference paper "Morphological Classification of Galaxies Using SpinalNet"

1 PAPER • NO BENCHMARKS YET

MuReD Dataset (Multi-Label Retinal Diseases Dataset)

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. One of the major stumbling blocks for manual retinal examination is the lack of a sufficient number of qualified medical personnel per capita to diagnose diseases. Computer-aided diagnosis systems (CAD) have proven to be very effective in helping physicians reduce the time taken to make a diagnosis and minimize variability in image interpretation. Still, they are not flexible enough to accommodate the simultaneous presence of multiple retinal diseases, which is a common situation in real-world applications. In the past years, few datasets that focus on the classification of numerous retinal pathologies present at the same time, i.e., multi-label classification have been proposed, but there are some shared problems with all of them, such as a narrow range of pathologies to classify, high level of class imbalance, low amount of samples for the underrepresented

1 PAPER • 1 BENCHMARK

Neural Field Arena - Classification

Neural fields (NeFs) have recently emerged as a versatile method for modeling signals of various modalities, including images, shapes, and scenes. Subsequently, many works have explored the use of NeFs as representations for downstream tasks, e.g. classifying an image based on the parameters of a NeF that has been fit to it. However, the impact of the NeF hyperparameters on their quality as downstream representation is scarcely understood and remains largely unexplored. This is partly caused by the large amount of time required to fit datasets of neural fields.

1 PAPER • NO BENCHMARKS YET

Nigeria 2020 cropland dataset

Hand-labelled dataset of crop and non-crop labels distributed throughout Nigeria with respective hd5f data arrays.

1 PAPER • NO BENCHMARKS YET

Niramai Oncho Dataset

Niramai Oncho Dataset (Niramai Onchocerciasis/RiverBlindness Dataset)

Onchocerciasis is causing blindness in over half a million people in the world today. Drug development for the disease is crippled as there is no way of measuring effectiveness of the drug without an invasive procedure. Drug efficacy measurement through assessment of viability of onchocerca worms requires the patients to undergo nodulectomy which is invasive, expensive, time-consuming, skill-dependent, infrastructure dependent and lengthy process.

1 PAPER • NO BENCHMARKS YET

Overall-Driving-Behavior-Recognition-By-Smartphone

Overall-Driving-Behavior-Recognition-By-Smartphone (Hamid Reza Eftekhari)

Monitoring and evaluating of driving behavior is the main goal of this paper that encourage us to develop a new system based on Inertial Measurement Unit (IMU) sensors of smartphones. In this system, a hybrid of Discrete Wavelet Transformation (DWT) and Adaptive Neuro Fuzzy Inference System (ANFIS) is used to recognize overall driving behaviors. The behaviors are classified into the safe, the semi-aggressive, and the aggressive classes that are adopted with Driver Anger Scale (DAS) self-reported questionnaire results. The proposed system extracts four features from IMU sensors in the forms of time series. They are decomposed by DWT in two levels and their energies are sent to six ANFISs. Each ANFIS models the different perception about driving behavior under uncertain knowledge and returns the similarity or dissimilarity between driving behaviors. The results of these six ANFISs are combined by three different decision fusion approaches. Results show that Coiflet-2 is the most suitable

1 PAPER • 1 BENCHMARK

Datasets

124 dataset results for Classification