🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

34 dataset results for Audio Classification

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.

586 PAPERS • 6 BENCHMARKS

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

342 PAPERS • 4 BENCHMARKS

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. It comprises 2000 5s-clips of 50 different classes across natural, human and domestic sounds, again, drawn from Freesound.org.

296 PAPERS • 6 BENCHMARKS

VGG-Sound

Consists of more than 210k videos for 310 audio classes.

150 PAPERS • 3 BENCHMARKS

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (EPIC-KITCHENS-55), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection also enables evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected under the same hypotheses albeit "two years on". The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition.

134 PAPERS • 7 BENCHMARKS

UrbanSound8K

Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to www.freesound.org.

122 PAPERS • 1 BENCHMARK

FSD50K (Freesound Database 50K)

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra. It consists mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more.

116 PAPERS • 2 BENCHMARKS

UCR Time Series Classification Archive

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a

32 PAPERS • 2 BENCHMARKS

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).

20 PAPERS • 7 BENCHMARKS

DiCOVA

The DiCOVA Challenge dataset is derived from the Coswara dataset, a crowd-sourced dataset of sound recordings from COVID-19 positive and non-COVID-19 individuals. The Coswara data is collected using a web-application2, launched in April-2020, accessible through the internet by anyone around the globe. The volunteering subjects are advised to record their respiratory sounds in a quiet environment.

19 PAPERS • 1 BENCHMARK

FSDnoisy18k

The FSDnoisy18k dataset is an open dataset containing 42.5 hours of audio across 20 sound event classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. The audio content is taken from Freesound, and the dataset was curated using the Freesound Annotator. The noisy set of FSDnoisy18k consists of 15,813 audio clips (38.8h), and the test set consists of 947 audio clips (1.4h) with correct labels. The dataset features two main types of label noise: in-vocabulary (IV) and out-of-vocabulary (OOV). IV applies when, given an observed label that is incorrect or incomplete, the true or missing label is part of the target class set. Analogously, OOV means that the true or missing label is not covered by those 20 classes.

18 PAPERS • NO BENCHMARKS YET

RAVDESS

RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

18 PAPERS • 6 BENCHMARKS

SHD (Spiking Heidelberg Digits)

The Spiking Heidelberg Digits (SHD) dataset is an audio-based classification dataset of 1k spoken digits ranging from zero to nine in the English and German languages. The audio waveforms have been converted into spike trains using an artificial model of the inner ear and parts of the ascending auditory pathway. The SHD dataset has 8,156 training and 2,264 test samples. A full description of the dataset and how it was created can be found in the paper below. Please cite this paper if you make use of the dataset.

15 PAPERS • 1 BENCHMARK

ICBHI Respiratory Sound Database (The Respiratory Sound database - ICBHI 2017 Challenge)

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.

11 PAPERS • 1 BENCHMARK

VocalSound

VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. The VocalSound dataset also contains meta-information such as speaker age, gender, native language, country, and health condition.

10 PAPERS • 1 BENCHMARK

YouTube-100M (YouTube-100m)

The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and 20M validation videos. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more topic identifiers from a set of 30,871 labels. There are an average of around 5 labels per video. The labels are assigned automatically based on a combination of metadata (title, description, comments, etc.), context, and image content for each video. The labels apply to the entire video and range from very generic (e.g. “Song”) to very specific (e.g. “Cormorant”). Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are often labeled “Entertainment” as well, although no hierarchy is enforced.

8 PAPERS • NO BENCHMARKS YET

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos from EPIC-KITCHENS-100. EPIC-SOUNDS includes 78.4k categorised and 39.2k non-categorised segments of audible events and actions, distributed across 44 classes.

7 PAPERS • 2 BENCHMARKS

SSC (Spiking Speech Commands v0.2)

The SSC dataset is a spiking version of the Speech Commands dataset release by Google (Speech Commands). SSC was generated using Lauscher, an artificial cochlea model. The SSC dataset consists of utterances recorded from a larger number of speakers under controlled conditions. Spikes were generated in 700 input channels, and it contains 35 word categories from a large number of speakers.

5 PAPERS • 1 BENCHMARK

SONYC-UST-V2

A dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor.

4 PAPERS • NO BENCHMARKS YET

TAU-NIGENS Spatial Sound Events 2021

The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a single direction-of-arrival (DoA) if static, a trajectory DoAs if moving, and a temporal onset and offset time. The isolated sound event recordings used for t

4 PAPERS • 1 BENCHMARK

aGender

The aGender corpus contains audio recordings of predefined utterances and free speech produced by humans of different age and gender. Each utterance is labeled as one of four age groups: Child, Youth, Adult, Senior, and as one of three gender classes: Female, Male and Child.

4 PAPERS • NO BENCHMARKS YET

DCASE 2014

DCASE2014 is an audio classification benchmark.

3 PAPERS • NO BENCHMARKS YET

HUME-VB

HUME-VB (The Hume Vocal Bursts Dataset)

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train and validation recordings.

3 PAPERS • 7 BENCHMARKS

SINGA:PURA

SINGA:PURA (SINGApore: Polyphonic URban Audio)

This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.

2 PAPERS • NO BENCHMARKS YET

VGGSound-Sparse

The dataset uses VGG-Sound which consists of 10s clips collected from YouTube for 309 sound classes. A subset of ‘temporally sparse’ classes is selected using the following procedure: 5–15 videos are randomly picked from each of the 309 VGGSound classes, and manually annotated as to whether audio-visual cues are only sparsely available. As a result, 12 classes are selected (∼4 %) or 6.5k and 0.6k videos in the train and test sets, respectively. The classes include 'dog barking', 'chopping wood', 'lion roaring', 'skateboarding' etc.

2 PAPERS • NO BENCHMARKS YET

BGG dataset (PUBG Gun Sound Dataset)

We recorded gun sounds by changing the type and position of guns to diversify distances and angles in the PUBG environment. The BGG dataset consists of 2,195 samples with 37 different types of guns and five directions, including a silence in which there is no gunfire, but noises exist. The distance from the firearms ranged from 0 meters to 600 meters. The audio was recorded in stereo (i.e., two-channel audio), and each sample contains various environmental noises (e.g., water splashing, walking, and bullet friction).

1 PAPER • NO BENCHMARKS YET

DEEP-VOICE: DeepFake Voice Recognition (Jordan Bird)

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

1 PAPER • 1 BENCHMARK

InfantMarmosetsVox

InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher. A PyTorch Dataloader is included in this dataset.

1 PAPER • 1 BENCHMARK

Multimodal PISA (Multimodal Piano Skills Assessment)

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.

1 PAPER • 3 BENCHMARKS

RESPIRATORY AND DRUG ACTUATION DATASET

Asthma is a common, usually long-term respiratory disease with negative impact on society and the economy worldwide. Treatment involves using medical devices (inhalers) that distribute medicationto the airways, and its efficiency depends on the precision of the inhalation technique. Health monitoring systems equipped with sensors and embedded with sound signal detection enable the recognition of drug actuation and could be powerful tools for reliable audio content analysis. The RDA Suite includes a set of tools for audio processing, feature extraction and classification and is provided along with a dataset consisting of respiratory and drug actuation sounds. The classification models in RDA are implemented based on conventional and advanced machine learning and deep network architectures. This study provides a comparative evaluation of the implemented approaches, examines potential improvements and discusses challenges and future tendencies. The central aim of this research is to ident

1 PAPER • NO BENCHMARKS YET

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708

1 PAPER • 1 BENCHMARK

Zooniverse

Zooniverse (HumBug Zooniverse)

The Humbug Zooinverse dataset is a dataset of mosquito audio recordings. With over a thousand contributors, it contains 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events.

1 PAPER • NO BENCHMARKS YET

nEMO

Overview nEMO is a simulated dataset of emotional speech in the Polish language. The corpus contains over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language. The corpus is available for free under the Creative Commons license (CC BY-NC-SA 4.0).

1 PAPER • NO BENCHMARKS YET

Mudestreda (Mudestreda Multimodal Device State Recognition Dataset)

Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.

0 PAPER • NO BENCHMARKS YET

Datasets

34 dataset results for Audio Classification