With 4. 5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition.
From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets.
Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications.
Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life.
This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution.
We apply noisy training to improve both dense and sparse state-of-the-art Emformer models and observe consistent WER reduction.
On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets.
no code implementations • 6 Apr 2021 • Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer
As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task.
In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used.
We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space.
no code implementations • 5 Apr 2021 • Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
DET gets similar accuracy as a baseline model with better latency on a large in-house data set by assigning a lightweight encoder for the beginning part of one utterance and a full-size encoder for the rest.
no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.
Our policy adapts the augmentation parameters based on the training loss of the data samples.
In this paper, we tackle the problem of handling narrowband and wideband speech by building a single acoustic model (AM), also called mixed bandwidth AM.