Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

19 May 2020  ·  Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig ·

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


Ranked #17 on Speech Recognition on LibriSpeech test-other (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Speech Recognition LibriSpeech test-clean CTC + Transformer LM rescoring Word Error Rate (WER) 2.10 # 21
Speech Recognition LibriSpeech test-other CTC + Transformer LM rescoring Word Error Rate (WER) 4.20 # 17

Methods


No methods listed for this paper. Add relevant methods here