WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus.
144 PAPERS • 2 BENCHMARKS
LibriMix is an open-source alternative to wsj0-2mix. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!.
80 PAPERS • 1 BENCHMARK
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. It has an extension called WHAMR! that adds artificial reverberation to the speech signals in addition to the background noise.
80 PAPERS • 5 BENCHMARKS
Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
64 PAPERS • 2 BENCHMARKS
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech sources in addition to the existing noise. Room impulse responses were generated and convolved using pyroomacoustics. Reverberation times were chosen to approximate domestic and classroom environments (expected to be similar to the restaurants and coffee shops where the WHAM! noise was collected), and further classified as high, medium, and low reverberation based on a qualitative assessment of the mixture’s noise recording.
45 PAPERS • 3 BENCHMARKS
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
27 PAPERS • 5 BENCHMARKS
Real-M is a crowd-sourced speech-separation corpus of real-life mixtures. The mixtures are recorded in different acoustic environments using a wide variety of recording devices such as laptops and smartphones, thus reflecting more closely potential application scenarios.
3 PAPERS • NO BENCHMARKS YET
Here we release the dataset (Multi_Channel_Grid, abbreviated as MC_Grid) used in our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.
1 PAPER • NO BENCHMARKS YET
WHAMR_ext is an extension to the WHAMR corpus with larger RT60 values (between 1s and 3s)
1 PAPER • 1 BENCHMARK