Deep Speech: Scaling up end-to-end speech recognition

17 Dec 2014Awni HannunCarl CaseJared CasperBryan CatanzaroGreg DiamosErich ElsenRyan PrengerSanjeev SatheeshShubho SenguptaAdam CoatesAndrew Y. Ng

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments... (read more)

PDF Abstract

Evaluation results from the paper


Task Dataset Model Metric name Metric value Global rank Compare
Noisy Speech Recognition CHiME clean CNN + Bi-RNN + CTC (speech to letters) Percentage error 6.3 # 2
Noisy Speech Recognition CHiME real CNN + Bi-RNN + CTC (speech to letters) Percentage error 67.94 # 4
Speech Recognition swb_hub_500 WER fullSWBCH CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB Percentage error 16 # 6
Speech Recognition Switchboard + Hub500 Deep Speech Percentage error 20 # 21
Speech Recognition Switchboard + Hub500 Deep Speech + FSH Percentage error 12.6 # 15
Speech Recognition Switchboard + Hub500 CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB Percentage error 12.6 # 15
Accented Speech Recognition VoxForge American-Canadian Deep Speech Percentage error 15.01 # 2
Accented Speech Recognition VoxForge Commonwealth Deep Speech Percentage error 28.46 # 2
Accented Speech Recognition VoxForge European Deep Speech Percentage error 31.20 # 2
Accented Speech Recognition VoxForge Indian Deep Speech Percentage error 45.35 # 2