The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
1,378 PAPERS • 10 BENCHMARKS
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:
56 PAPERS • 5 BENCHMARKS
The DEMAND (Diverse Environments Multichannel Acoustic Noise Database) provides a set of recordings that allow testing of algorithms using real-world noise in a variety of settings. This version provides 15 recordings. All recordings are made with a 16-channel array, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm.
54 PAPERS • 1 BENCHMARK
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. It has an extension called WHAMR! that adds artificial reverberation to the speech signals in addition to the background noise.
52 PAPERS • 5 BENCHMARKS
LibriMix is an open-source alternative to wsj0-2mix. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!.
43 PAPERS • 1 BENCHMARK
The REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge is a benchmark for evaluation of automatic speech recognition techniques. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channe, 2-channel or 8-channel microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data.
40 PAPERS • 1 BENCHMARK
The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual quality and intelligibility of the enhanced speech. The challenge evaluated the speech quality using the online subjective evaluation framework ITU-T P.808. The challenge provides large datasets for training noise suppressors.
34 PAPERS • 3 BENCHMARKS
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech sources in addition to the existing noise. Room impulse responses were generated and convolved using pyroomacoustics. Reverberation times were chosen to approximate domestic and classroom environments (expected to be similar to the restaurants and coffee shops where the WHAM! noise was collected), and further classified as high, medium, and low reverberation based on a qualitative assessment of the mixture’s noise recording.
29 PAPERS • 3 BENCHMARKS
The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning.
28 PAPERS • NO BENCHMARKS YET
Contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio.
17 PAPERS • 1 BENCHMARK
The QMUL underGround Re-IDentification (GRID) dataset contains 250 pedestrian image pairs. Each pair contains two images of the same individual seen from different camera views. All images are captured from 8 disjoint camera views installed in a busy underground station. The figures beside show a snapshot of each of the camera views of the station and sample images in the dataset. The dataset is challenging due to variations of pose, colours, lighting changes; as well as poor image quality caused by low spatial resolution.
8 PAPERS • 5 BENCHMARKS
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
7 PAPERS • 4 BENCHMARKS
L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.
7 PAPERS • NO BENCHMARKS YET
L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage.
4 PAPERS • 2 BENCHMARKS
The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.
1 PAPER • NO BENCHMARKS YET