🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language (clear)

69 dataset results for Classification AND English

The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

593 PAPERS • 13 BENCHMARKS

CIFAR-10C

Common corruptions dataset for CIFAR10

414 PAPERS • NO BENCHMARKS YET

SGD

SGD (Schema-Guided Dialogue)

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, user simulation learning, among other tasks in large-scale virtual assistants. Besides these, the dataset has unseen domains and services in the evaluation set to quantify the performance in zero-shot or few shot settings.

170 PAPERS • 3 BENCHMARKS

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

47 PAPERS • 2 BENCHMARKS

InDL (In-Diagram Logic)

Dataset Introduction

11 PAPERS • 1 BENCHMARK

N-ImageNet (Large-Scale Dataset for Event-Based Object Recognition)

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera around a monitor displaying images from ImageNet. N-ImageNet contains approximately 1,300k training samples and 50k validation samples. In addition, the dataset also contains variants of the validation dataset recorded under a wide range of lighting or camera trajectories. Additional details about the dataset are explained in the paper available through this link. Please cite this paper if you make use of the dataset.

11 PAPERS • 3 BENCHMARKS

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and at 48,000 samples/second for drive end bearing experiments. All fan end bearing data was collected at 12,000 samples/second.

9 PAPERS • 1 BENCHMARK

SSC (Spiking Speech Commands v0.2)

The SSC dataset is a spiking version of the Speech Commands dataset release by Google (Speech Commands). SSC was generated using Lauscher, an artificial cochlea model. The SSC dataset consists of utterances recorded from a larger number of speakers under controlled conditions. Spikes were generated in 700 input channels, and it contains 35 word categories from a large number of speakers.

6 PAPERS • 1 BENCHMARK

XImageNet-12 (XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation)

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment!

5 PAPERS • 1 BENCHMARK

ALTA 2021 Shared Task

ALTA 2021 Shared Task (Automatic Grading of Evidence, 10 years later)

This dataset is described in the ALTA 2021 Shared Task website and associated CodaLab competition.

4 PAPERS • NO BENCHMARKS YET

ArtiFact (Artificial and Factual Image Dataset for Synthetic Image Detection)

The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.

4 PAPERS • NO BENCHMARKS YET

DiaMOS Plant (A Dataset for Diagnosis and Monitoring Plant Disease)

Abstract The classification and recognition of foliar diseases is an increasingly developing field of research, where the concepts of machine and deep learning are used to support agricultural stakeholders. Datasets are the fuel for the development of these technologies. In this paper, we release and make publicly available the field dataset collected to diagnose and monitor plant symptoms, called DiaMOS Plant, consisting of 3505 images of pear fruit and leaves affected by four diseases. In addition, we perform a comparative analysis of existing literature datasets designed for the classification and recognition of leaf diseases, highlighting the main features that maximize the value and information content of the collected data. This study provides guidelines that will be useful to the research community in the context of the selection and construction of datasets.

4 PAPERS • NO BENCHMARKS YET

RITE (Retinal Images vessel Tree Extraction)

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries and veins on retinal fundus images, which is established based on the public available DRIVE database (Digital Retinal Images for Vessel Extraction).

4 PAPERS • 2 BENCHMARKS

AnthroProtect

For a detailed description, we refer to Section 3 in our research article.

3 PAPERS • NO BENCHMARKS YET

Brain Tumor Dataset

This brain tumor dataset contains 3064 T1-weighted contrast-enhanced images with three kinds of brain tumor. Detailed information on the dataset can be found in the readme file.

3 PAPERS • NO BENCHMARKS YET

MS-FIMU

MS-FIMU (Mobility Scenario FIMU)

Open Dataset: Mobility Scenario FIMU

3 PAPERS • NO BENCHMARKS YET

Tasksource

Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.

3 PAPERS • NO BENCHMARKS YET

Xhate999

We present XHate-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XHate-999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaption, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups.

3 PAPERS • NO BENCHMARKS YET

ADHD-200

Attention Deficit Hyperactivity Disorder (ADHD) affects at least 5-10% of school-age children and is associated with substantial lifelong impairment, with annual direct costs exceeding $36 billion/year in the US. Despite a voluminous empirical literature, the scientific community remains without a comprehensive model of the pathophysiology of ADHD. Further, the clinical community remains without objective biological tools capable of informing the diagnosis of ADHD for an individual or guiding clinicians in their decision-making regarding treatment.

2 PAPERS • NO BENCHMARKS YET

CWD30 (Crop Weed Dataset 30 species)

CWD30 comprises over 219,770 high-resolution images of 20 weed species and 10 crop species, encompassing various growth stages, multiple viewing angles, and environmental conditions. The images were collected from diverse agricultural fields across different geographic locations and seasons, ensuring a representative dataset.

2 PAPERS • NO BENCHMARKS YET

Deep PCB (Deep Printed Circuit Board)

DeepPCB

2 PAPERS • 1 BENCHMARK

DeepHS Fruit v2

The data set covers recordings of ripening fruit with labels of destructive measurements (fruit flesh firmness, sugar content and overall ripeness). The labels are provided within three categories (firmness, sweetness and overall ripeness). Four measurement series were performed. Besides 1018 labeled recordings, the data set contains 4671 recordings without ripeness label.

2 PAPERS • NO BENCHMARKS YET

Dermatology ddx dataset

The dermatology differential diagnoses (ddx) dataset for skin condition classification includes expert annotations and model predictions for 1947 cases. Note that no images or meta information are provided. The expert annotations come in the form of differential diagnoses, i.e., partial rankings of conditions, and there is a high level of disagreement among experts, making this a perfect benchmark for dealing with disagreement. The data has been introduced in [1] and [2].

2 PAPERS • NO BENCHMARKS YET

Fashion-MNIST-H

We provide multiple human annotations for each test image in Fashion-MNIST. This can be used as soft labels or probabilistic labels instead of the usual hard (single) labels.

2 PAPERS • NO BENCHMARKS YET

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel tasks of multimodal figurative understanding and preference.

2 PAPERS • 2 BENCHMARKS

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.

2 PAPERS • 2 BENCHMARKS

Oracle-MNIST

Oracle-MNIST (Oracle-MNIST: a Realistic Image Dataset for Benchmarking Machine Learning Algorithms)

We introduce the Oracle-MNIST dataset, comprising of 2828 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

2 PAPERS • NO BENCHMARKS YET

AjwaOrMedjool

AjwaOrMedjool (AjwaOrMedjool: a binary balanced dataset to teach machine learning‏)

The dataset contains three subsets:

1 PAPER • NO BENCHMARKS YET

BFN

BFN (Backdoored Face-Networks Dataset)

This database is a database of backdoored neural networks intended for face recognition. The networks are of the FaceNet architecture and are trained on Casia-WebFace, with and without additional samples (which are the source of the backdoor). More information regarding backdoors and the project within which this fits can be found in the public release of the source code : https://gitlab.idiap.ch/bob/bob.paper.backdoored_facenets.biosig2022.

1 PAPER • NO BENCHMARKS YET

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t

1 PAPER • NO BENCHMARKS YET

CORBEL (Conveyor belt pressure signal dataset))

Dataset included measuring static tension under 2 kg load in different points of the CB and measurements in dynamic conditions. The latter conditions presumed the range of the linear belt speeds between nu_1 = 0.5 and nu_max = 1.7 m/s. 400 Hz unified sampling frequency for the experiments. It corresponded with 140 samples.

1 PAPER • 1 BENCHMARK

CRCDX

CRCDX (TCGA-CRC-DX)

Histological images of colorectal cancer, derived from the TCGA database

1 PAPER • NO BENCHMARKS YET

CVE (Common Vulnerabilities and Exposures)

CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.

1 PAPER • NO BENCHMARKS YET

Cards Against Humanity

A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.

1 PAPER • NO BENCHMARKS YET

Colors

A large dataset of color names and their respective RGB values stores in CSV.

1 PAPER • 1 BENCHMARK

DIGITal (Digitally Generated Numerals)

Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.

1 PAPER • NO BENCHMARKS YET

Daily and Sports Activities

The dataset comprises motion sensor data of 19 daily and sports activities each performed by 8 subjects in their own style for 5 minutes. Five Xsens MTx units are used on the torso, arms, and legs.

1 PAPER • NO BENCHMARKS YET

DeepGraviLens

DeepGraviLens is a data set of simulated gravitational lenses consisting of images associated with brightness variation time series. In this dataset, both non-transient and transient phenomena (supernovae explosions) are simulated.

1 PAPER • NO BENCHMARKS YET

DeepParliament

DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.

1 PAPER • NO BENCHMARKS YET

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 PAPER • NO BENCHMARKS YET

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

1 PAPER • NO BENCHMARKS YET

Food Recall Incidents Dataset

The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:

1 PAPER • NO BENCHMARKS YET

FracAtlas (A Dataset for Fracture Classification, Localization and Segmentation of Musculoskeletal Radiographs)

FractureAtlas is a musculoskeletal bone fracture dataset with annotations for deep learning tasks like classification, localization, and segmentation. The dataset contains a total of 4,083 X-Ray images with annotation in COCO, VGG, YOLO, and Pascal VOC format. This dataset is made freely available for any purpose. The data provided within this work are free to copy, share or redistribute in any medium or format. The data might be adapted, remixed, transformed, and built upon. The dataset is licensed under a CC-BY 4.0 license. It should be noted that to use the dataset correctly, one needs to have knowledge of medical and radiology fields to understand the results and make conclusions based on the dataset. It's also important to consider the possibility of labeling errors.

1 PAPER • NO BENCHMARKS YET

Graph dataset MCF-7

Graph dataset MCF-7 (MCF-7)

Dataset introduced by Xifeng Yan et al.

1 PAPER • NO BENCHMARKS YET

Graph dataset MOLT-4

Graph dataset MOLT-4 (MOLT-4)

Dataset introduced by Xifeng Yan et al.

1 PAPER • NO BENCHMARKS YET

HOWS (HOWS-CL-25)

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.

1 PAPER • 2 BENCHMARKS

HRPlanesV2

HRPlanesV2 (HRPlanesv2 - High Resolution Satellite Imagery for Aircraft Detection)

The HRPlanesv2 dataset contains 2120 VHR Google Earth images. To further improve experiment results, images of airports from many different regions with various uses (civil/military/joint) selected and labeled. A total of 14,335 aircrafts have been labelled. Each image is stored as a ".jpg" file of size 4800 x 2703 pixels and each label is stored as YOLO ".txt" format. Dataset has been split in three parts as 70% train, %20 validation and test. The aircrafts in the images in the train and validation datasets have a percentage of 80 or more in size. Link: https://github.com/dilsadunsal/HRPlanesv2-Data-Set

1 PAPER • NO BENCHMARKS YET

Industrial Benchmark Dataset for Customer Escalation Prediction

This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

1 PAPER • NO BENCHMARKS YET