WenetSpeech

Introduced by Zhang et al. in WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Speech Recognition	WenetSpeech	Paraformer-large

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

wenet-e2e/wenetspeech

450

Tasks

Speech Recognition

Similar Datasets

TAT

AISHELL-2

GigaSpeech

AISHELL-1

Source: https://github.com/wenet-e2e/wenetspeech.

Usage

WenetSpeech

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

TAT

AISHELL-2

GigaSpeech

AISHELL-1

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages