SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

5 Apr 2021  ·  William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi ·

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Speech Recognition AMI IMH SpeechStew (100M) Word Error Rate (WER) 9 # 1
Speech Recognition AMI SDM1 SpeechStew (100M) Word Error Rate (WER) 21.7 # 1
Speech Recognition CHiME-6 dev_gss12 SpeechStew (1B) Word Error Rate (WER) 31.9 # 1
Speech Recognition CHiME-6 eval SpeechStew (1B) Word Error Rate (WER) 38.9 # 1
Speech Recognition Common Voice SpeechStew (1B) Test WER 10.8% # 1
Speech Recognition LibriSpeech test-clean SpeechStew (100M) Word Error Rate (WER) 2.0 # 12
Speech Recognition LibriSpeech test-clean SpeechStew (1B) Word Error Rate (WER) 1.7 # 4
Speech Recognition LibriSpeech test-other SpeechStew (100M) Word Error Rate (WER) 4.0 # 11
Speech Recognition LibriSpeech test-other SpeechStew (1B) Word Error Rate (WER) 3.3 # 5
Speech Recognition Switchboard CallHome SpeechStew (100M) Word Error Rate (WER) 8.3 # 1
Speech Recognition Switchboard SWBD SpeechStew (100M) Word Error Rate (WER) 4.7 # 1
Speech Recognition Tedlium SpeechStew (100M) Word Error Rate (WER) 5.3 # 1
Speech Recognition WSJ eval92 Speechstew 100M Word Error Rate (WER) 1.3 # 1

Methods


No methods listed for this paper. Add relevant methods here