Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
314 PAPERS • 164 BENCHMARKS
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
163 PAPERS • 1 BENCHMARK
AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage.
49 PAPERS • 4 BENCHMARKS
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
38 PAPERS • 1 BENCHMARK
The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.
9 PAPERS • NO BENCHMARKS YET
CSRC is a collection of data for Children Speech Recognition. The data for this challenge is divided into 3 datasets, referred to as A (Adult speech training set), C1 (Children speech training set) and C2 (Children conversation training set). All dataset combined amount to 400 hours of Mandarin speech data.
4 PAPERS • NO BENCHMARKS YET