Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.
399 PAPERS • 3 BENCHMARKS
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.
238 PAPERS • 4 BENCHMARKS
The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. It comprises 2000 5s-clips of 50 different classes across natural, human and domestic sounds, again, drawn from Freesound.org.
185 PAPERS • 6 BENCHMARKS
Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to www.freesound.org.
99 PAPERS • 2 BENCHMARKS
This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (EPIC-KITCHENS-55), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection also enables evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected under the same hypotheses albeit "two years on". The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition.
77 PAPERS • 5 BENCHMARKS
Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra. It consists mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more.
68 PAPERS • 2 BENCHMARKS
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a
23 PAPERS • 2 BENCHMARKS
The DiCOVA Challenge dataset is derived from the Coswara dataset, a crowd-sourced dataset of sound recordings from COVID-19 positive and non-COVID-19 individuals. The Coswara data is collected using a web-application2, launched in April-2020, accessible through the internet by anyone around the globe. The volunteering subjects are advised to record their respiratory sounds in a quiet environment.
18 PAPERS • 1 BENCHMARK
The FSDnoisy18k dataset is an open dataset containing 42.5 hours of audio across 20 sound event classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. The audio content is taken from Freesound, and the dataset was curated using the Freesound Annotator. The noisy set of FSDnoisy18k consists of 15,813 audio clips (38.8h), and the test set consists of 947 audio clips (1.4h) with correct labels. The dataset features two main types of label noise: in-vocabulary (IV) and out-of-vocabulary (OOV). IV applies when, given an observed label that is incorrect or incomplete, the true or missing label is part of the target class set. Analogously, OOV means that the true or missing label is not covered by those 20 classes.
14 PAPERS • NO BENCHMARKS YET
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.
14 PAPERS • 6 BENCHMARKS
The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.
8 PAPERS • 1 BENCHMARK
The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and 20M validation videos. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more topic identifiers from a set of 30,871 labels. There are an average of around 5 labels per video. The labels are assigned automatically based on a combination of metadata (title, description, comments, etc.), context, and image content for each video. The labels apply to the entire video and range from very generic (e.g. “Song”) to very specific (e.g. “Cormorant”). Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are often labeled “Entertainment” as well, although no hierarchy is enforced.
7 PAPERS • NO BENCHMARKS YET
VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. The VocalSound dataset also contains meta-information such as speaker age, gender, native language, country, and health condition.
5 PAPERS • 1 BENCHMARK
A dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor.
3 PAPERS • NO BENCHMARKS YET
The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a single direction-of-arrival (DoA) if static, a trajectory DoAs if moving, and a temporal onset and offset time. The isolated sound event recordings used for t
3 PAPERS • 1 BENCHMARK
The aGender corpus contains audio recordings of predefined utterances and free speech produced by humans of different age and gender. Each utterance is labeled as one of four age groups: Child, Youth, Adult, Senior, and as one of three gender classes: Female, Male and Child.
3 PAPERS • NO BENCHMARKS YET
This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.
2 PAPERS • NO BENCHMARKS YET
We recorded gun sounds by changing the type and position of guns to diversify distances and angles in the PUBG environment. The BGG dataset consists of 2,195 samples with 37 different types of guns and five directions, including a silence in which there is no gunfire, but noises exist. The distance from the firearms ranged from 0 meters to 600 meters. The audio was recorded in stereo (i.e., two-channel audio), and each sample contains various environmental noises (e.g., water splashing, walking, and bullet friction).
1 PAPER • NO BENCHMARKS YET
The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train and validation recordings.
1 PAPER • NO BENCHMARKS YET
Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.
1 PAPER • 3 BENCHMARKS
Asthma is a common, usually long-term respiratory disease with negative impact on society and the economy worldwide. Treatment involves using medical devices (inhalers) that distribute medicationto the airways, and its efficiency depends on the precision of the inhalation technique. Health monitoring systems equipped with sensors and embedded with sound signal detection enable the recognition of drug actuation and could be powerful tools for reliable audio content analysis. The RDA Suite includes a set of tools for audio processing, feature extraction and classification and is provided along with a dataset consisting of respiratory and drug actuation sounds. The classification models in RDA are implemented based on conventional and advanced machine learning and deep network architectures. This study provides a comparative evaluation of the implemented approaches, examines potential improvements and discusses challenges and future tendencies. The central aim of this research is to ident
1 PAPER • NO BENCHMARKS YET
The dataset uses VGG-Sound which consists of 10s clips collected from YouTube for 309 sound classes. A subset of ‘temporally sparse’ classes is selected using the following procedure: 5–15 videos are randomly picked from each of the 309 VGGSound classes, and manually annotated as to whether audio-visual cues are only sparsely available. As a result, 12 classes are selected (∼4 %) or 6.5k and 0.6k videos in the train and test sets, respectively. The classes include 'dog barking', 'chopping wood', 'lion roaring', 'skateboarding' etc.
1 PAPER • NO BENCHMARKS YET