The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challening Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
1,337 PAPERS • 10 BENCHMARKS
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.
234 PAPERS • 4 BENCHMARKS
MUSAN is a corpus of music, speech and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises.
126 PAPERS • NO BENCHMARKS YET
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
122 PAPERS • 1 BENCHMARK
WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus.
114 PAPERS • 2 BENCHMARKS
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus. The main differences from the LibriSpeech corpus are listed below:
91 PAPERS • NO BENCHMARKS YET
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. It has an extension called WHAMR! that adds artificial reverberation to the speech signals in addition to the background noise.
50 PAPERS • 5 BENCHMARKS
LibriMix is an open-source alternative to wsj0-2mix. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!.
43 PAPERS • 1 BENCHMARK
Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
42 PAPERS • NO BENCHMARKS YET
The REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge is a benchmark for evaluation of automatic speech recognition techniques. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channe, 2-channel or 8-channel microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data.
40 PAPERS • 1 BENCHMARK
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
38 PAPERS • 1 BENCHMARK
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
38 PAPERS • NO BENCHMARKS YET
Multimodal Opinionlevel Sentiment Intensity (MOSI) contains: (1) multimodal observations including transcribed speech and visual gestures as well as automatic audio and visual features, (2) opinion-level subjectivity segmentation, (3) sentiment intensity annotations with high coder agreement, and (4) alignment between words, visual and acoustic features.
38 PAPERS • 1 BENCHMARK
AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage.
36 PAPERS • NO BENCHMARKS YET
VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.
35 PAPERS • 3 BENCHMARKS
The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.
34 PAPERS • NO BENCHMARKS YET
2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English telephone conversations used in the 2000 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).
31 PAPERS • 1 BENCHMARK
The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning.
28 PAPERS • NO BENCHMARKS YET
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech sources in addition to the existing noise. Room impulse responses were generated and convolved using pyroomacoustics. Reverberation times were chosen to approximate domestic and classroom environments (expected to be similar to the restaurants and coffee shops where the WHAM! noise was collected), and further classified as high, medium, and low reverberation based on a qualitative assessment of the mixture’s noise recording.
28 PAPERS • 3 BENCHMARKS
The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, recording environment, ambient noise, number of speakers, and speaker demographics. The development set includes reference diarization and speech segmentation and may be used for any purpose including system development or training.
26 PAPERS • 1 BENCHMARK
THCHS-30 is a free Chinese speech database THCHS-30 that can be used to build a full-fledged Chinese speech recognition system.
26 PAPERS • NO BENCHMARKS YET
CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
22 PAPERS • NO BENCHMARKS YET
GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.
22 PAPERS • 3 BENCHMARKS
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences. It also comes with the word and phone-level transcriptions of the speech.
22 PAPERS • 1 BENCHMARK
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody.
21 PAPERS • NO BENCHMARKS YET
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
19 PAPERS • NO BENCHMARKS YET
ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
16 PAPERS • NO BENCHMARKS YET
The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3.
16 PAPERS • 2 BENCHMARKS
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
14 PAPERS • 1 BENCHMARK
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) is an extension of the CMU Wilderness Multilingual Speech Dataset, a speech dataset based on recorded readings of the New Testament.
13 PAPERS • 3 BENCHMARKS
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.
13 PAPERS • 6 BENCHMARKS
In SpokenSQuAD, the document is in spoken form, the input question is in the form of text and the answer to each question is always a span in the document. The following procedures were used to generate spoken documents from the original SQuAD dataset. First, the Google text-to-speech system was used to generate the spoken version of the articles in SQuAD. Then CMU Sphinx was sued to generate the corresponding ASR transcriptions. The SQuAD training set was used to generate the training set of Spoken SQuAD, and SQuAD development set was used to generate the testing set for Spoken SQuAD. If the answer of a question did not exist in the ASR transcriptions of the associated article, the question-answer pair was removed from the dataset because these examples are too difficult for listening comprehension machine at this stage.
13 PAPERS • 1 BENCHMARK
Stuttering Events in Podcasts (SEP-28k) is a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter.
10 PAPERS • NO BENCHMARKS YET
Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of microphones: 3 (Microsoft Kinect, Yamaha, Samson) Count of wave-files per microphone: about 14500 Overall count of participations: 180 (130 male / 50 female)
10 PAPERS • 1 BENCHMARK
SPEECH-COCO contains speech captions that are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images.
8 PAPERS • NO BENCHMARKS YET
BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations. The corpus can be used to build automatic simultaneous interpretation system. The corpus is collected from the Chinese mandarin talks and reports, including science, technology, culture, economy, etc.,. The utterances in talks and reports are carefully transcribed into Chinese text, and further translated into English text. The sentence boundary is determined by the English text instead of the Chinese text which is analogous to the previous related corpus (TED and Translation Augmented LibriSpeech Corpus).
7 PAPERS • NO BENCHMARKS YET
CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems
7 PAPERS • NO BENCHMARKS YET
GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:
7 PAPERS • 1 BENCHMARK
JVS is a Japanese multi-speaker voice corpus which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices.
7 PAPERS • NO BENCHMARKS YET
Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. The first phase of the SLUE benchmark suite consists of named entity recognition (NER), sentiment analysis (SA), and ASR on the corresponding datasets.
7 PAPERS • 3 BENCHMARKS
VOCASET is a 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio. The dataset has 12 subjects and 480 sequences of about 3-4 seconds each with sentences chosen from an array of standard protocols that maximize phonetic diversity.
7 PAPERS • NO BENCHMARKS YET
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
6 PAPERS • 3 BENCHMARKS
The MRDA corpus consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts.
6 PAPERS • 1 BENCHMARK
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home
6 PAPERS • 7 BENCHMARKS
Libri-adhoc40 is a synchronized speech corpus which collects the replayed Librispeech data from loudspeakers by ad-hoc microphone arrays of 40 strongly synchronized distributed nodes in a real office environment. Besides, to provide the evaluation target for speech frontend processing and other applications, the authors also recorded the replayed speech in an anechoic chamber.
5 PAPERS • NO BENCHMARKS YET
The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.
5 PAPERS • NO BENCHMARKS YET
SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours of professionally-transcribed financial audio. Contrary to previous transcription datasets, SPGISpeech contains global english accents, strongly varying audio quality as well as both spontaneous and presentation style speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted including sentence structure and capitalization.
5 PAPERS • 1 BENCHMARK