This dataset contains named entities annotations for European Parliament recordings in Dutch, French, German and Spanish. The entity annotation scheme follows OntoNotes v5. The original unannotated dataset is VoxPopuli.
1 PAPER • NO BENCHMARKS YET
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
1 PAPER • 2 BENCHMARKS
JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.
1 PAPER • 5 BENCHMARKS
NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.
2 PAPERS • NO BENCHMARKS YET
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
4 PAPERS • NO BENCHMARKS YET
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, acoustic conditions, speaker styles, and transcription requirements.
A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:
0 PAPER • NO BENCHMARKS YET
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>
This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2].
2 PAPERS • 1 BENCHMARK
The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.
9 PAPERS • NO BENCHMARKS YET
Data collection was conducted by asking some adults from social media and some students from an elementary school to participate in our experiment. Table.1 shows the number of data gathered for recognizing each color. Due to the fact that two words are used for black in Persian, the number of black samples is more. In addition, because the color recognition is a RAN task, a sequence of data has been gathered. Table.2 depicts the number of sequence data for colors. For the meaningless words, 12 voices have been gathered on average for each word (there are 40 meaningless words in this task).
ADIMA is a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.
The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.
Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. The first phase of the SLUE benchmark suite consists of named entity recognition (NER), sentiment analysis (SA), and ASR on the corresponding datasets.
20 PAPERS • 3 BENCHMARKS
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
39 PAPERS • 1 BENCHMARK
Europarl-ASR (EN) is a 1300-hour English-language speech and text corpus of parliamentary debates for (streaming) Automatic Speech Recognition training and benchmarking, speech data filtering and speech data verbatimization, based on European Parliament speeches and their official transcripts (1996-2020). Includes dev-test sets for streaming ASR benchmarking, made up of 18 hours of manually revised speeches. The availability of manual non-verbatim and verbatim transcripts for dev-test speeches makes this corpus also useful for the assessment of automatic filtering and verbatimization techniques. The corpus is released under an open licence at https://www.mllp.upv.es/europarl-asr/
5 PAPERS • 2 BENCHMARKS
The OLR 2021 dataset contains the data for the Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios.
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
15 PAPERS • 4 BENCHMARKS
CrowdSpeech is a publicly available large-scale dataset of crowdsourced audio transcriptions. It contains annotations for more than 20 hours of English speech from more than 1,000 crowd workers.
Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.
56 PAPERS • 3 BENCHMARKS
This Sanskrit speech corpus has more than 78 hours of audio data and contains recordings of 45,953 sentences with a sampling rate of 22KHz. The content is mainly readings of texts spanning over various Śāstras of Saṃskṛtam literature and also includes contemporary stories, radio program, extempore discourse, etc.
SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours of professionally-transcribed financial audio. Contrary to previous transcription datasets, SPGISpeech contains global english accents, strongly varying audio quality as well as both spontaneous and presentation style speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted including sentence structure and capitalization.
13 PAPERS • 1 BENCHMARK
Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. The dataset has four intents, corresponding to four common offline voice assistant uses: SetTimer, SetAlarm, SimpleMath, and UnitConversion. The semantic label for each utterance is a dictionary with the intent and a number of slots.
4 PAPERS • 1 BENCHMARK
MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing. The dataset contains 10 hours of speech for each language provided. This release contains audio datasets in French, Arabic, Turkish and Spanish, and is a part of a larger private dataset.
CSRC is a collection of data for Children Speech Recognition. The data for this challenge is divided into 3 datasets, referred to as A (Adult speech training set), C1 (Children speech training set) and C2 (Children conversation training set). All dataset combined amount to 400 hours of Mandarin speech data.
CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
33 PAPERS • NO BENCHMARKS YET
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
55 PAPERS • NO BENCHMARKS YET
EmoSpeech contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis.
The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:
The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.
44 PAPERS • NO BENCHMARKS YET
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .
347 PAPERS • 4 BENCHMARKS
AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.
Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of microphones: 3 (Microsoft Kinect, Yamaha, Samson) Count of wave-files per microphone: about 14500 Overall count of participations: 180 (130 male / 50 female)
11 PAPERS • 1 BENCHMARK
The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
1,972 PAPERS • 8 BENCHMARKS
The REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge is a benchmark for evaluation of automatic speech recognition techniques. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channe, 2-channel or 8-channel microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data.
50 PAPERS • 1 BENCHMARK
2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English telephone conversations used in the 2000 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).
31 PAPERS • 1 BENCHMARK
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
164 PAPERS • 1 BENCHMARK
AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage.
50 PAPERS • 4 BENCHMARKS
The Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR) is a dataset for automatic speech recognition for Sorani Kurdish.
The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning.
37 PAPERS • NO BENCHMARKS YET
ClovaCall is a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people. The raw dataset of ClovaCall includes approximately 112,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.
FT Speech is a speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data.
Libri-Adapt aims to support unsupervised domain adaptation research on speech recognition models.
Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
64 PAPERS • 2 BENCHMARKS
LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.
3 PAPERS • NO BENCHMARKS YET
MASRI-HEADSET is a corpus that was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender.
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) is an extension of the CMU Wilderness Multilingual Speech Dataset, a speech dataset based on recorded readings of the New Testament.
15 PAPERS • 3 BENCHMARKS