SimSceneTVB Perception is a corpus of 100 sound scenes of 45s each representing urban sound environments, including: 6 scenes recorded in Paris, 19 scenes simulated using simScene to replicate recorded scenarios, 75 scenes simulated using simScene with diverse new scenarios, containing traffic, human voices and bird sources.The base audio files used for simulation are obtained from Freesound (https://freesound.org) and LibriSpeech (http://www.openslr.org/12).
0 PAPER • NO BENCHMARKS YET
The Sound Events for Surveillance Applications (SESA) dataset files were obtained from Freesound. The dataset was divided between train (480 files) and test (105 files) folders. All audio files are WAV, Mono-Channel, 16 kHz, and 8-bit with up to 33 seconds. # Classes: 0 - Casual (not a threat) 1 - Gunshot 2 - Explosion 3 - Siren (also contains alarms).
The TAU Spatial Sound Events 2019 - Ambisonic dataset contains recordings from a scene (along with the Microphone Array sister dataset). It provides four-channel First-Order Ambisonic (FOA) recordings. The recordings consist of stationary point sources from multiple sound classes each associated with a temporal onset and offset time, and DOA coordinate represented using azimuth and elevation angle. The development set consists of 400, one minute long recordings sampled at 48000 Hz, and divided into four cross-validation splits of 100 recordings each. These recordings were synthesized using spatial room impulse response (IRs) collected from five indoor locations, at 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings, the collected IRs were convolved with isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected in the IR recording locations was added t
The TAU Spatial Sound Events 2019 – Microphone Array dataset contains recordings from a scene (along with the Ambisonic sister dataset). It provides four-channel directional microphone recordings from a tetrahedral array configuration. The recordings consist of stationary point sources from multiple sound classes each associated with a temporal onset and offset time, and DOA coordinate represented using azimuth and elevation angle. The development set consists of 400, one minute long recordings sampled at 48000 Hz, and divided into four cross-validation splits of 100 recordings each. These recordings were synthesized using spatial room impulse response (IRs) collected from five indoor locations, at 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings, the collected IRs were convolved with isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected in the
TUT Rare Sound events 2017, development dataset consists of source files for creating mixtures of rare sound events (classes baby cry, gun shot, glass break) with background audio, as well a set of readily generated mixtures and recipes for generating them. The "source" part of the dataset consists of two subsets: (a) background recordings from 15 different acoustic scenes, (b) recordings with the target rare sound events from three classes, accompanied by annotations of their temporal occurrences, (c) a set of meta files providing the cross-validation setup: lists of background and target event recordings split into training and test subsets (called "devtrain" and "devtest", respectively, indicating they are provided as the development dataset, as opposed to the evaluation dataset released separately). The mixture set consists of two subsets (training and testing), each containing ~1500 mixtures (~500 per target class in each subset, with half of the mixtures not containing any target
The TUT Sounds Event 2018 dataset consists of real-life first order Ambisonic (FOA) format recordings with stationary point sources each associated with a spatial coordinate. The dataset was generated by collecting impulse responses (IR) from a real environment using the Eigenmike spherical microphone array. The measurement was done by slowly moving a Genelec G Two loudspeaker continuously playing a maximum length sequence around the array in circular trajectory in one elevation at a time. The playback volume was set to be 30 dB greater than the ambient sound level. The recording was done in a corridor inside the university with classrooms around it during work hours. The IRs were collected at elevations −40 to 40 with 10-degree increments at 1 m from the Eigenmike and at elevations −20 to 20 with 10-degree increments at 2 m.
The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/).
The WASABI Song Corpus is a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an important part of the semantics of a song, the authors focus on the description of the methods they proposed to extract relevant information from the lyrics, such as their structure segmentation, their topics, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed. The corpus contains 1.73M songs with lyrics (1.41M unique lyrics) annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and segmentation recommendation of songs.
Yesno is an audio dataset consisting of 60 recordings of one individual saying yes or no in Hebrew; each recording is eight words long. It was created for the Kaldi audio project by an author who wishes to remain anonymous.
Freefield1010 is a collection of 7,690 excerpts from field recordings around the world, gathered by the FreeSound project, and then standardised for research.
warblrb10k is a collection of 10,000 smartphone audio recordings from around the UK, crowdsourced by users of Warblr the bird recognition app. The audio covers a wide distribution of UK locations and environments, and includes weather noise, traffic noise, human speech and even human bird imitations.