DCASE 2016 is a dataset for sound event detection. It consists of 20 short mono sound files for each of 11 sound classes (from office environments, like clearthroat, drawer, or keyboard), each file containing one sound event instance. Sound files are annotated with event on- and offset times, however silences between actual physical sounds (like with a phone ringing) are not marked and hence “included” in the event.
30 PAPERS • NO BENCHMARKS YET
The MTG-Jamendo dataset is an open dataset for music auto-tagging. The dataset contains over 55,000 full audio tracks with 195 tags categories (87 genre tags, 40 instrument tags, and 56 mood/theme tags). It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. All audio is distributed in 320kbps MP3 format.
15 PAPERS • NO BENCHMARKS YET
TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a tram, travelling by a bus, travelling by an underground metro and urban park. Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.
11 PAPERS • 2 BENCHMARKS
DCASE 2013 is a dataset for sound event detection. It consists of audio-only recordings where individual sound events are prominent in an acoustic scene.
10 PAPERS • NO BENCHMARKS YET
The TUT Acoustic Scenes 2017 dataset is a collection of recordings from various acoustic scenes all from distinct locations. For each recording location 3-5 minute long audio recordings are captured and are split into 10 seconds which act as unit of sample for this task. All the audio clips are recorded with 44.1 kHz sampling rate and 24 bit resolution.
10 PAPERS • NO BENCHMARKS YET
TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes:
5 PAPERS • 1 BENCHMARK
The LITIS-Rouen dataset is a dataset for audio scenes. It consists of 3026 examples of 19 scene categories. Each class is specific to a location such as a train station or an open market. The audio recordings have a duration of 30 seconds and a sampling rate of 22050 Hz. The dataset has a total duration of 1500 minutes.
3 PAPERS • NO BENCHMARKS YET
The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. The dataset was recorded in six large european cities, in different locations for each scene class. For each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.
2 PAPERS • 1 BENCHMARK
CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 acoustic scenes.
1 PAPER • NO BENCHMARKS YET
The TUT Sounds Event 2018 dataset consists of real-life first order Ambisonic (FOA) format recordings with stationary point sources each associated with a spatial coordinate. The dataset was generated by collecting impulse responses (IR) from a real environment using the Eigenmike spherical microphone array. The measurement was done by slowly moving a Genelec G Two loudspeaker continuously playing a maximum length sequence around the array in circular trajectory in one elevation at a time. The playback volume was set to be 30 dB greater than the ambient sound level. The recording was done in a corridor inside the university with classrooms around it during work hours. The IRs were collected at elevations −40 to 40 with 10-degree increments at 1 m from the Eigenmike and at elevations −20 to 20 with 10-degree increments at 2 m.
0 PAPER • NO BENCHMARKS YET