SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

18 Apr 2019  ·  Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le ·

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Speech Recognition Hub5'00 SwitchBoard LAS + SpecAugment (with LM, Switchboard mild policy) CallHome 14.6 # 2
SwitchBoard 6.8 # 1
Speech Recognition Hub5'00 SwitchBoard LAS + SpecAugment (with LM, Switchboard strong policy) CallHome 14 # 1
SwitchBoard 7.1 # 2
Speech Recognition LibriSpeech test-clean LAS + SpecAugment Word Error Rate (WER) 2.5 # 31
Speech Recognition LibriSpeech test-clean LAS (no LM) Word Error Rate (WER) 2.7 # 34
Speech Recognition LibriSpeech test-other LAS (no LM) Word Error Rate (WER) 6.5 # 33
Speech Recognition LibriSpeech test-other LAS + SpecAugment Word Error Rate (WER) 5.8 # 30

Methods


No methods listed for this paper. Add relevant methods here