SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

5 Apr 2021  ·  William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi ·

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Speech Recognition AMI IMH SpeechStew (100M) Word Error Rate (WER) 9 # 2
Speech Recognition AMI SDM1 SpeechStew (100M) Word Error Rate (WER) 21.7 # 2
Speech Recognition CHiME-6 dev_gss12 SpeechStew (1B) Word Error Rate (WER) 31.9 # 3
Speech Recognition CHiME-6 eval SpeechStew (1B) Word Error Rate (WER) 38.9 # 3
Speech Recognition Common Voice SpeechStew (1B) Test WER 10.8% # 2
Speech Recognition LibriSpeech test-clean SpeechStew (1B) Word Error Rate (WER) 1.7 # 6
Speech Recognition LibriSpeech test-clean SpeechStew (100M) Word Error Rate (WER) 2.0 # 16
Speech Recognition LibriSpeech test-other SpeechStew (1B) Word Error Rate (WER) 3.3 # 7
Speech Recognition LibriSpeech test-other SpeechStew (100M) Word Error Rate (WER) 4.0 # 14
Speech Recognition Switchboard CallHome SpeechStew (100M) Word Error Rate (WER) 8.3 # 1
Speech Recognition Switchboard SWBD SpeechStew (100M) Word Error Rate (WER) 4.7 # 1
Speech Recognition Tedlium SpeechStew (100M) Word Error Rate (WER) 5.3 # 2
Speech Recognition WSJ eval92 Speechstew 100M Word Error Rate (WER) 1.3 # 1

Methods


No methods listed for this paper. Add relevant methods here