Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

8 Dec 2015Dario AmodeiRishita AnubhaiEric BattenbergCarl CaseJared CasperBryan CatanzaroJingdong ChenMike ChrzanowskiAdam CoatesGreg DiamosErich ElsenJesse EngelLinxi FanChristopher FougnerTony HanAwni HannunBilly JunPatrick LeGresleyLibby LinSharan NarangAndrew NgSherjil OzairRyan PrengerJonathan RaimanSanjeev SatheeshDavid SeetapunShubho SenguptaYi WangZhiqian WangChong WangBo XiaoDani YogatamaJun ZhanZhenyao Zhu

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages... (read more)

PDF Abstract

Evaluation results from the paper


 SOTA for Speech Recognition on WSJ eval93 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric name Metric value Global rank Uses extra
training data
Compare
Noisy Speech Recognition CHiME clean Deep Speech 2 Percentage error 3.34 # 1
Noisy Speech Recognition CHiME real Deep Speech 2 Percentage error 21.79 # 3
Speech Recognition LibriSpeech test-other Deep Speech 2 Word Error Rate (WER) 13.25 # 7
Accented Speech Recognition VoxForge American-Canadian Deep Speech 2 Percentage error 7.55 # 1
Accented Speech Recognition VoxForge Commonwealth Deep Speech 2 Percentage error 13.56 # 1
Accented Speech Recognition VoxForge European Deep Speech 2 Percentage error 17.55 # 1
Accented Speech Recognition VoxForge Indian Deep Speech 2 Percentage error 22.44 # 1
Speech Recognition WSJ eval92 Deep Speech 2 Percentage error 3.60 # 4
Speech Recognition WSJ eval93 Deep Speech 2 Percentage error 4.98 # 1