The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
2,351 PAPERS • 9 BENCHMARKS
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
448 PAPERS • 145 BENCHMARKS
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
321 PAPERS • 3 BENCHMARKS
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
139 PAPERS • 7 BENCHMARKS
VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.
113 PAPERS • 2 BENCHMARKS
GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.
87 PAPERS • 4 BENCHMARKS
Multilingual LibriSpeech is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages.
77 PAPERS • 4 BENCHMARKS
The CALLHOME English Corpus is a collection of unscripted telephone conversations between native speakers of English. Here are the key details:
11 PAPERS • 7 BENCHMARKS
Earnings-22 is a practical benchmark designed to evaluate automatic speech recognition (ASR) systems' performance on real-world, accented audio. Let me provide you with more details:
10 PAPERS • 1 BENCHMARK
ITALIC: An ITALian Intent Classification Dataset
4 PAPERS • NO BENCHMARKS YET
Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes repr
3 PAPERS • NO BENCHMARKS YET
JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.
2 PAPERS • 5 BENCHMARKS
MSC is a dataset for Macro-Management in StarCraft 2 based on the platfrom SC2LE. It consists of well-designed feature vectors, pre-defined high-level actions and final result of each match. It contains 36,619 high quality replays, which are unbroken and played by relatively professional players.
2 PAPERS • NO BENCHMARKS YET
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.
2 PAPERS • 1 BENCHMARK
OpenSLR is a repository of open speech and language resources, including large-scale transcribed audio corpora and related software. It serves as a central platform for researchers and practitioners to access and share datasets used in speech recognition (ASR), text-to-speech (TTS), and linguistic research.
2 PAPERS • 2 BENCHMARKS
BERSt Dataset
1 PAPER • 2 BENCHMARKS
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
1 PAPER • NO BENCHMARKS YET
MediBeng Dataset The MediBeng dataset contains synthetic code-switched dialogues in Bengali and English for training models in speech recognition (ASR), text-to-speech (TTS), and machine translation in clinical settings. The dataset is available under the CC-BY-4.0 license.
1 PAPER • 1 BENCHMARK
The Storytelling Video Dataset is a high-quality, human-reviewed multimodal dataset featuring over 700 full-body video recordings of native Russian speakers. Each video is 10+ minutes long and includes synchronized speech, facial expressions, gestures, and emotional variation. The dataset is ideal for research and development in:
0 PAPER • NO BENCHMARKS YET