Deep Speech: Scaling up end-to-end speech recognition

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Noisy Speech Recognition CHiME clean CNN + Bi-RNN + CTC (speech to letters) Percentage error 6.3 # 2
Noisy Speech Recognition CHiME real CNN + Bi-RNN + CTC (speech to letters) Percentage error 67.94 # 5
Speech Recognition swb_hub_500 WER fullSWBCH CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB Percentage error 16 # 8
Speech Recognition Switchboard + Hub500 Deep Speech Percentage error 20 # 30
Speech Recognition Switchboard + Hub500 CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB Percentage error 12.6 # 18
Speech Recognition Switchboard + Hub500 Deep Speech + FSH Percentage error 12.6 # 18
Accented Speech Recognition VoxForge American-Canadian Deep Speech Percentage error 15.01 # 2
Accented Speech Recognition VoxForge Commonwealth Deep Speech Percentage error 28.46 # 2
Accented Speech Recognition VoxForge European Deep Speech Percentage error 31.20 # 2
Accented Speech Recognition VoxForge Indian Deep Speech Percentage error 45.35 # 2

Methods


No methods listed for this paper. Add relevant methods here