no code implementations • 23 Aug 2024 • Adnan Haider, Xingyu Na, Erik McDermott, Tim Ng, Zhen Huang, Xiaodan Zhuang
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 14 Jun 2024 • Roger Hsiao, Liuhui Deng, Erik McDermott, Ruchir Travadi, Xiaodan Zhuang
Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 24 May 2024 • Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly
Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 29 Nov 2022 • Stefan Braun, Erik McDermott, Roger Hsiao
As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 2 Nov 2022 • Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva, Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza Silovsky, Ruchir Travadi, Xiaodan Zhuang
This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios.
no code implementations • 21 Oct 2022 • Thien Nguyen, Nathalie Tran, Liuhui Deng, Thiago Fraga da Silva, Matthew Radzihovsky, Roger Hsiao, Henry Mason, Stefan Braun, Erik McDermott, Dogan Can, Pawel Swietojanski, Lyan Verwimp, Sibel Oyman, Tresi Arvizo, Honza Silovsky, Arnab Ghoshal, Mathieu Martel, Bharat Ram Ambati, Mohamed Ali
Code-switching describes the practice of using more than one language in the same sentence.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 26 Feb 2020 • Erik McDermott, Hasim Sak, Ehsan Variani
The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) {audio, transcript} training data pairs are used to train the RNN-T.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
5 code implementations • 7 Feb 2020 • Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy.