Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .
309 PAPERS • 4 BENCHMARKS
TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a tram, travelling by a bus, travelling by an underground metro and urban park. Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.
11 PAPERS • 2 BENCHMARKS
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home
8 PAPERS • 7 BENCHMARKS
The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud, are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided in Dataset.
3 PAPERS • 1 BENCHMARK
The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes.
2 PAPERS • 2 BENCHMARKS
This noisy speech test set is created from the Google Speech Commands v2  and the Musan dataset.
2 PAPERS • 1 BENCHMARK
Auto-KWS is a dataset for customized keyword spotting, the task of detecting spoken keywords. The dataset closely resembles real world scenarios, as each recorder is assigned with an unique wake-up word and can choose their recording environment and familiar dialect freely.
1 PAPER • NO BENCHMARKS YET