State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data.
Our average WER of all languages outperforms average monolingual baseline by 33. 3%, and the state-of-the-art 2-stage XLSR by 32%.
Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result.
These models are typically trained on the server using transcribed speech data.
Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance.
no code implementations • 27 Sep 2021 • Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio.
While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns.
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models.
no code implementations • 14 Dec 2019 • Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, Giovanni Motta, Lillian Zhou
With speech input, if the user corrects only the names, the name recall rate improves to 64. 4%.
Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers.
2 code implementations • 15 Nov 2018 • Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, Alexander Gruenstein
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition.
More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -- we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data.
We demonstrate this method's usefulness in revealing information divergence in the bases of recurrent factorized kernels, visualizing the character-level differences between the memory of n-gram and recurrent language models, and extracting knowledge of history encoded in the layers of grapheme-based end-to-end ASR networks.
Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network.