TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speech Recognition	WenetSpeech	Wenet	Character Error Rate (CER)	8.88	# 5
Speech Recognition	WenetSpeech	Espnet	Character Error Rate (CER)	9.7	# 7
Speech Recognition	WenetSpeech	Kaldi	Character Error Rate (CER)	9.07	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wenetspeech-a-10000-hours-multi-domain/speech-recognition-on-wenetspeech)](https://paperswithcode.com/sota/speech-recognition-on-wenetspeech?p=wenetspeech-a-10000-hours-multi-domain)`

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

7 Oct 2021 · BinBin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di wu, Zhendong Peng ·

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

PDF Abstract

Code

Add Remove Mark official

wenet-e2e/wenetspeech official

450

Tasks

Add Remove

Label Error Detection

Optical Character Recognition

Optical Character Recognition (OCR)

speech-recognition

Speech Recognition

Text Segmentation

Datasets

Introduced in the Paper:

WenetSpeech

Used in the Paper:

LibriSpeech

Results from the Paper

Edit

Ranked #5 on Speech Recognition on WenetSpeech

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speech Recognition	WenetSpeech	Wenet	Character Error Rate (CER)	8.88	# 5	Compare
Speech Recognition	WenetSpeech	Espnet	Character Error Rate (CER)	9.7	# 7	Compare
Speech Recognition	WenetSpeech	Kaldi	Character Error Rate (CER)	9.07	# 6	Compare

Methods

Add Remove

Test

Edit Social Preview

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove