Search Results for author: Tara N. Sainath

Found 79 papers, 6 papers with code

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

no code implementations • 27 Feb 2024 • Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno

In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation

no code implementations • 20 Feb 2024 • Wen Wu, Bo Li, Chao Zhang, Chung-Cheng Chiu, Qiujia Li, Junwen Bai, Tara N. Sainath, Philip C. Woodland

The evidential uncertainty measure is extended to quantify the uncertainty in emotion distribution estimation.

Classification Emotion Classification

Paper
Add Code

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

no code implementations • 23 Jan 2024 • W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck.

Language Modelling Large Language Model +2

Paper
Add Code

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

no code implementations • 17 Jan 2024 • Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones.

Paper
Add Code

Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-primary Speakers

no code implementations • 18 Dec 2023 • Guru Prakash Arumugam, Shuo-Yiin Chang, Tara N. Sainath, Rohit Prabhavalkar, Quan Wang, Shaan Bijwadia

ASR models often suffer from a long-form deletion problem where the model predicts sequential blanks instead of words when transcribing a lengthy audio (in the order of minutes or hours).

speech-recognition Speech Recognition

Paper
Add Code

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

no code implementations • 13 Dec 2023 • Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

no code implementations • 14 Aug 2023 • Shaan Bijwadia, Shuo-Yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving Joint Speech-Text Representations Without Alignment

no code implementations • 11 Aug 2023 • Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly.

Speech Recognition

Paper
Add Code

How to Estimate Model Transferability of Pre-Trained Speech Models?

1 code implementation • 1 Jun 2023 • Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath

In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.

Paper
Code

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

no code implementations • 28 May 2023 • W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-Yiin Chang, Tara N. Sainath

We address this limitation by distilling punctuation knowledge from a bidirectional teacher language model (LM) trained on written, punctuated text.

Decoder Language Modelling +2

Paper
Add Code

Mixture-of-Expert Conformer for Streaming Multilingual ASR

no code implementations • 25 May 2023 • Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays

We evaluate the proposed model on a set of 12 languages, and achieve an average 11. 9% relative improvement in WER over the baseline.

Automatic Speech Recognition speech-recognition +1

Paper
Add Code

Modular Domain Adaptation for Conformer-Based Streaming ASR

no code implementations • 22 May 2023 • Qiujia Li, Bo Li, Dongseong Hwang, Tara N. Sainath, Pedro M. Mengibar

Speech data from different domains has distinct acoustic and linguistic characteristics.

Domain Adaptation speech-recognition +1

Paper
Add Code

Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

no code implementations • 31 Mar 2023 • Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu

Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers.

Decoder

Paper
Add Code

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

no code implementations • 31 Mar 2023 • Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays

Overall, they present a modular, powerful and cheap alternative to the standard encoder output, as well as the N-best hypotheses.

Decoder speech-recognition +1

Paper
Add Code

A Deliberation-based Joint Acoustic and Text Decoder

no code implementations • 23 Mar 2023 • Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data.

Decoder speech-recognition +1

Paper
Add Code

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

no code implementations • 15 Mar 2023 • Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw

Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.

speech-recognition Speech Recognition

Paper
Add Code

End-to-End Speech Recognition: A Survey

no code implementations • 3 Mar 2023 • Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

UML: A Universal Monolingual Output Layer for Multilingual ASR

no code implementations • 22 Feb 2023 • Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-Yiin Chang

Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Massively Multilingual Shallow Fusion with Large Language Models

no code implementations • 17 Feb 2023 • Ke Hu, Tara N. Sainath, Bo Li, Nan Du, Yanping Huang, Andrew M. Dai, Yu Zhang, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman

In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

no code implementations • 16 Feb 2023 • Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition.

Language Modelling speech-recognition +1

Paper
Add Code

Efficient Domain Adaptation for Speech Foundation Models

no code implementations • 3 Feb 2023 • Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, Francoise Beaufays

The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data.

Decoder Domain Adaptation +3

Paper
Add Code

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

no code implementations • 19 Jan 2023 • Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

no code implementations • 28 Nov 2022 • W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model.

Decoder Vocal Bursts Valence Prediction

Paper
Add Code

Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

no code implementations • 4 Nov 2022 • Zhouyuan Huo, Khe Chai Sim, Bo Li, Dongseong Hwang, Tara N. Sainath, Trevor Strohman

Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

no code implementations • 2 Nov 2022 • Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios.

Spoken Command Recognition

Paper
Add Code

JOIST: A Joint Speech and Text Streaming Model For ASR

no code implementations • 13 Oct 2022 • Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works.

Paper
Add Code

Scaling Up Deliberation for Multilingual ASR

no code implementations • 11 Oct 2022 • Ke Hu, Bo Li, Tara N. Sainath

In this work, we investigate second-pass deliberation for multilingual speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A Language Agnostic Multilingual Streaming On-Device ASR System

no code implementations • 29 Aug 2022 • Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Turn-Taking Prediction for Natural Conversational Speech

no code implementations • 29 Aug 2022 • Shuo-Yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task.

speech-recognition Speech Recognition

Paper
Add Code

Streaming Intended Query Detection using E2E Modeling for Continued Conversation

no code implementations • 29 Aug 2022 • Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Qiao Liang, Tara N. Sainath, Bo Li, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query. However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations.

Paper
Add Code

Improving Deliberation by Text-Only and Semi-Supervised Training

no code implementations • 29 Jun 2022 • Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data.

Decoder Language Modelling

Paper
Add Code

Self-Supervised Speech Representation Learning: A Review

no code implementations • 21 May 2022 • Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

no code implementations • 22 Apr 2022 • W. Ronny Huang, Shuo-Yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition.

Sentence speech-recognition +1

Paper
Add Code

Improving Rare Word Recognition with LM-aware MWER Training

no code implementations • 15 Apr 2022 • Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups.

Paper
Add Code

Streaming Align-Refine for Non-autoregressive Deliberation

no code implementations • 15 Apr 2022 • Weiran Wang, Ke Hu, Tara N. Sainath

We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.

Decoder

Paper
Add Code

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

no code implementations • 13 Apr 2022 • Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

no code implementations • 9 Mar 2022 • W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection.

Language Modelling Sentence +2

Paper
Add Code

Improving the fusion of acoustic and text representations in RNN-T

no code implementations • 25 Jan 2022 • Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-Yiin Chang

The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Joint Unsupervised and Supervised Training for Multilingual ASR

no code implementations • 15 Nov 2021 • Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath

Our average WER of all languages outperforms average monolingual baseline by 33. 3%, and the state-of-the-art 2-stage XLSR by 32%.

Language Modelling Masked Language Modeling +3

Paper
Add Code

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

no code implementations • 27 Sep 2021 • Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio.

Ranked #1 on Speech Recognition on Common Voice

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Tied & Reduced RNN-T Decoder

no code implementations • 15 Sep 2021 • Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He

Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003. 07705 [eess. AS], [2], arXiv:2012. 06749 [cs. CL]).

Decoder Language Modelling

Paper
Add Code

Scaling End-to-End Models for Large-Scale Multilingual ASR

no code implementations • 30 Apr 2021 • Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data.

Multi-Task Learning

Paper
Add Code

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

no code implementations • 9 Apr 2021 • W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table.

Language Modelling Sentence +2

Paper
Add Code

Learning Word-Level Confidence For Subword End-to-End ASR

no code implementations • 11 Mar 2021 • David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Transformer Based Deliberation for Two-Pass Speech Recognition

no code implementations • 27 Jan 2021 • Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman

In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring.

Decoder speech-recognition +2

Paper
Add Code

Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

no code implementations • 12 Dec 2020 • Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A Better and Faster End-to-End Model for Streaming ASR

no code implementations • 21 Nov 2020 • Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR.

Audio and Speech Processing Sound

Paper
Add Code

Multitask Training with Text Data for End-to-End Speech Recognition

no code implementations • 27 Oct 2020 • Peidong Wang, Tara N. Sainath, Ron J. Weiss

We propose a multitask training method for attention-based end-to-end speech recognition models.

Decoder Language Modelling +2

Paper
Add Code

Cascaded encoders for unifying streaming and non-streaming ASR

no code implementations • 27 Oct 2020 • Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

The proposed model consists of streaming and non-streaming encoders.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

1 code implementation • 21 Oct 2020 • Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

FastEmit also improves streaming ASR accuracy from 4. 4%/8. 9% to 3. 1%/7. 5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

no code implementations • ICLR 2021 • Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Wu, Ruoming Pang

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

no code implementations • 24 Aug 2020 • Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion

no code implementations • 19 May 2020 • Cal Peyser, Tara N. Sainath, Golan Pundak

Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems in that a particular name may appear only rarely during training, and may have a pronunciation similar to that of a more common word.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

no code implementations • 7 May 2020 • Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22. 3% to 14. 8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67. 0% to 25. 3%.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Towards Fast and Accurate Streaming End-to-End ASR

no code implementations • 24 Apr 2020 • Bo Li, Shuo-Yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu

RNN-T EP+LAS, together with MWER training brings in 18. 7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP model.

Audio and Speech Processing

Paper
Add Code

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

no code implementations • 28 Mar 2020 • Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-Yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i. e., word error rate (WER), and latency, i. e., the time the hypothesis is finalized after the user stops speaking.

Sentence

Paper
Add Code

Deliberation Model Based Two-Pass End-to-End Speech Recognition

no code implementations • 17 Mar 2020 • Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Recognizing long-form speech using streaming end-to-end models

no code implementations • 24 Oct 2019 • Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, David Rybach, Tara N. Sainath, Trevor Strohman

In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

no code implementations • 11 Sep 2019 • Anjuli Kannan, Arindrima Datta, Tara N. Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, Seungji Lee

Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Two-Pass End-to-End Speech Recognition

1 code implementation • 29 Aug 2019 • Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

However, this model still lags behind a large state-of-the-art conventional model in quality [2].

speech-recognition Speech Recognition +1

Paper
Code

Improving Performance of End-to-End ASR on Numeric Sequences

no code implementations • 1 Jul 2019 • Cal Peyser, Hao Zhang, Tara N. Sainath, Zelin Wu

This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e. g.

speech-recognition Speech Recognition

Paper
Add Code

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

no code implementations • 21 Jun 2019 • Ke Hu, Antoine Bruguier, Tara N. Sainath, Rohit Prabhavalkar, Golan Pundak

Contextual automatic speech recognition, i. e., biasing recognition towards a given context (e. g. user's playlists, or contacts), is challenging in end-to-end (E2E) models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A spelling correction model for end-to-end speech recognition

no code implementations • 19 Feb 2019 • Jinxi Guo, Tara N. Sainath, Ron J. Weiss

Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs.

Language Modelling speech-recognition +2

Paper
Add Code

Streaming End-to-end Speech Recognition For Mobile Devices

2 code implementations • 15 Nov 2018 • Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, Alexander Gruenstein

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition.

speech-recognition Speech Recognition

905

Paper
Code

Contextual Speech Recognition with Difficult Negative Training Examples

no code implementations • 29 Oct 2018 • Uri Alon, Golan Pundak, Tara N. Sainath

Improving the representation of contextual information is key to unlocking the potential of end-to-end (E2E) automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Deep context: end-to-end contextual speech recognition

no code implementations • 7 Aug 2018 • Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, Anjuli Kannan, Ding Zhao

Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

no code implementations • 27 Jul 2018 • Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N. Sainath, Karen Livescu

In this paper, we compare a suite of past methods and some of our own proposed methods for using unpaired text data to improve encoder-decoder models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

An analysis of incorporating an external language model into a sequence-to-sequence model

no code implementations • 6 Dec 2017 • Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N. Sainath, Zhifeng Chen, Rohit Prabhavalkar

Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, language model, and alignment mechanism.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

no code implementations • 5 Dec 2017 • Tara N. Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu, Zhifeng Chen, Chung-Cheng Chiu

However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units.

Language Modelling

Paper
Add Code

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

2 code implementations • 5 Dec 2017 • Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan

Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

4 code implementations • 5 Dec 2017 • Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

578

Paper
Code

Improving the Performance of Online Neural Transducer Models

no code implementations • 5 Dec 2017 • Tara N. Sainath, Chung-Cheng Chiu, Rohit Prabhavalkar, Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Zhifeng Chen

Neural transducer is a streaming sequence-to-sequence model, but has shown a significant degradation in performance compared to non-streaming models such as Listen, Attend and Spell (LAS).

Paper
Add Code

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

no code implementations • 5 Dec 2017 • Bo Li, Tara N. Sainath, Khe Chai Sim, Michiel Bacchiani, Eugene Weinstein, Patrick Nguyen, Zhifeng Chen, Yonghui Wu, Kanishka Rao

Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network.

speech-recognition Speech Recognition

Paper
Add Code

Multilingual Speech Recognition With A Single End-To-End Model

no code implementations • 6 Nov 2017 • Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Learning Compact Recurrent Neural Networks

no code implementations • 9 Apr 2016 • Zhiyun Lu, Vikas Sindhwani, Tara N. Sainath

Recurrent neural networks (RNNs), including long short-term memory (LSTM) RNNs, have produced state-of-the-art results on a variety of speech recognition tasks.

speech-recognition Speech Recognition

Paper
Add Code

Structured Transforms for Small-Footprint Deep Learning

no code implementations • NeurIPS 2015 • Vikas Sindhwani, Tara N. Sainath, Sanjiv Kumar

We consider the task of building compact deep learning pipelines suitable for deployment on storage and power constrained mobile devices.

Keyword Spotting speech-recognition +1

Paper
Add Code

Improvements to deep convolutional neural networks for LVCSR

no code implementations • 5 Sep 2013 • Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.

Speech Recognition

Paper
Add Code

Accelerating Hessian-free optimization for deep neural networks by implicit preconditioning and sampling

no code implementations • 5 Sep 2013 • Tara N. Sainath, Lior Horesh, Brian Kingsbury, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian.

Paper
Add Code

Deep Neural Network Language Models

no code implementations • WS 2012 • Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran

Language Modelling Machine Translation +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.