Search Results for author: Rohit Prabhavalkar

Found 56 papers, 7 papers with code

Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

no code implementations • 15 Apr 2024 • Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar

Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data.

Paper
Add Code

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

no code implementations • 27 Feb 2024 • Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno

In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-primary Speakers

no code implementations • 18 Dec 2023 • Guru Prakash Arumugam, Shuo-Yiin Chang, Tara N. Sainath, Rohit Prabhavalkar, Quan Wang, Shaan Bijwadia

ASR models often suffer from a long-form deletion problem where the model predicts sequential blanks instead of words when transcribing a lengthy audio (in the order of minutes or hours).

speech-recognition Speech Recognition

Paper
Add Code

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

no code implementations • 13 Dec 2023 • Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning

no code implementations • 29 Sep 2023 • Lillian Zhou, Yuxin Ding, Mingqing Chen, Harry Zhang, Rohit Prabhavalkar, Dhruv Guliani, Giovanni Motta, Rajiv Mathews

Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm

no code implementations • 29 Sep 2023 • Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Pedro Moreno Mengibar

Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Massive End-to-end Models for Short Search Queries

no code implementations • 22 Sep 2023 • Weiran Wang, Rohit Prabhavalkar, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li, James Qin, Xingyu Cai, Adam Stooke, Zhong Meng, CJ Zheng, Yanzhang He, Tara Sainath, Pedro Moreno Mengibar

In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improving Joint Speech-Text Representations Without Alignment

no code implementations • 11 Aug 2023 • Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly.

Speech Recognition

Paper
Add Code

How to Estimate Model Transferability of Pre-Trained Speech Models?

1 code implementation • 1 Jun 2023 • Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath

In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.

Paper
Code

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

no code implementations • 19 Apr 2023 • Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny Huang, Tara Sainath

Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus.

Paper
Add Code

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

no code implementations • 31 Mar 2023 • Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays

Overall, they present a modular, powerful and cheap alternative to the standard encoder output, as well as the N-best hypotheses.

speech-recognition Speech Recognition

Paper
Add Code

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

no code implementations • 15 Mar 2023 • Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw

Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.

speech-recognition Speech Recognition

Paper
Add Code

End-to-End Speech Recognition: A Survey

no code implementations • 3 Mar 2023 • Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

no code implementations • 2 Mar 2023 • Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

no code implementations • 16 Feb 2023 • Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition.

Language Modelling speech-recognition +1

Paper
Add Code

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

no code implementations • 19 Jan 2023 • Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Dual Learning for Large Vocabulary On-Device ASR

no code implementations • 11 Jan 2023 • Cal Peyser, Ronny Huang, Tara Sainath, Rohit Prabhavalkar, Michael Picheny, Kyunghyun Cho

Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once.

Paper
Add Code

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

no code implementations • 28 Nov 2022 • W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model.

Vocal Bursts Valence Prediction

Paper
Add Code

Modular Hybrid Autoregressive Transducer

no code implementations • 31 Oct 2022 • Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder.

Language Modelling speech-recognition +1

Paper
Add Code

JOIST: A Joint Speech and Text Streaming Model For ASR

no code implementations • 13 Oct 2022 • Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works.

Paper
Add Code

Improving Deliberation by Text-Only and Semi-Supervised Training

no code implementations • 29 Jun 2022 • Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data.

Language Modelling

Paper
Add Code

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

no code implementations • 22 Apr 2022 • W. Ronny Huang, Shuo-Yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition.

Sentence speech-recognition +1

Paper
Add Code

Improving Rare Word Recognition with LM-aware MWER Training

no code implementations • 15 Apr 2022 • Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups.

Paper
Add Code

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

no code implementations • 13 Apr 2022 • Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Neural-FST Class Language Model for End-to-End Speech Recognition

no code implementations • 28 Jan 2022 • Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu, Bo wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework.

Language Modelling speech-recognition +1

Paper
Add Code

Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

no code implementations • 8 Oct 2021 • Zhiyun Lu, Yanwei Pan, Thibault Doutre, Parisa Haghani, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A Neural Acoustic Echo Canceller Optimized Using An Automatic Speech Recognizer And Large Scale Synthetic Data

no code implementations • 1 Jun 2021 • Nathan Howard, Alex Park, Turaj Zakizadeh Shabestary, Alexander Gruenstein, Rohit Prabhavalkar

We consider the problem of recognizing speech utterances spoken to a device which is generating a known sound waveform; for example, recognizing queries issued to a digital assistant which is generating responses to previous user inputs.

Acoustic echo cancellation Automatic Speech Recognition +3

Paper
Add Code

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

no code implementations • 6 Apr 2021 • Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

no code implementations • 5 Apr 2021 • Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

DET gets similar accuracy as a baseline model with better latency on a large in-house data set by assigning a lightweight encoder for the beginning part of one utterance and a full-size encoder for the rest.

speech-recognition Speech Recognition

Paper
Add Code

Learning Word-Level Confidence For Subword End-to-End ASR

no code implementations • 11 Mar 2021 • David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

no code implementations • 12 Dec 2020 • Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Cascaded encoders for unifying streaming and non-streaming ASR

no code implementations • 27 Oct 2020 • Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

The proposed model consists of streaming and non-streaming encoders.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

no code implementations • 20 Oct 2020 • Daria Soboleva, Ondrej Skopek, Márius Šajgalík, Victor Cărbune, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

no code implementations • 7 May 2020 • Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22. 3% to 14. 8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67. 0% to 25. 3%.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

no code implementations • 28 Mar 2020 • Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-Yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i. e., word error rate (WER), and latency, i. e., the time the hypothesis is finalized after the user stops speaking.

Sentence

Paper
Add Code

Deliberation Model Based Two-Pass End-to-End Speech Recognition

no code implementations • 17 Mar 2020 • Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

A comparison of end-to-end models for long-form speech recognition

no code implementations • 6 Nov 2019 • Chung-Cheng Chiu, Wei Han, Yu Zhang, Ruoming Pang, Sergey Kishchenko, Patrick Nguyen, Arun Narayanan, Hank Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Zhifeng Chen, Tara Sainath, Yonghui Wu

In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Recognizing long-form speech using streaming end-to-end models

no code implementations • 24 Oct 2019 • Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, David Rybach, Tara N. Sainath, Trevor Strohman

In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Two-Pass End-to-End Speech Recognition

1 code implementation • 29 Aug 2019 • Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

However, this model still lags behind a large state-of-the-art conventional model in quality [2].

speech-recognition Speech Recognition +1

Paper
Code

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

no code implementations • 21 Jun 2019 • Ke Hu, Antoine Bruguier, Tara N. Sainath, Rohit Prabhavalkar, Golan Pundak

Contextual automatic speech recognition, i. e., biasing recognition towards a given context (e. g. user's playlists, or contacts), is challenging in end-to-end (E2E) models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

2 code implementations • 21 Feb 2019 • Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

2,779

Paper
Code

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition

3 code implementations • 5 Feb 2019 • Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen

We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model.

Ranked #42 on Speech Recognition on LibriSpeech test-clean

Language Modelling Sequence-To-Sequence Speech Recognition +1

Paper
Code

Streaming End-to-end Speech Recognition For Mobile Devices

2 code implementations • 15 Nov 2018 • Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, Alexander Gruenstein

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition.

speech-recognition Speech Recognition

900

Paper
Code

From Audio to Semantics: Approaches to end-to-end spoken language understanding

no code implementations • 24 Sep 2018 • Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, Austin Waters

Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Deep context: end-to-end contextual speech recognition

no code implementations • 7 Aug 2018 • Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, Anjuli Kannan, Ding Zhao

Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

no code implementations • 2 Jan 2018 • Kanishka Rao, Haşim Sak, Rohit Prabhavalkar

The model consists of an `encoder', which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a `decoder' which is partially initialized from a recurrent neural network language model trained on text data alone.

Language Modelling speech-recognition +1

Paper
Add Code

An analysis of incorporating an external language model into a sequence-to-sequence model

no code implementations • 6 Dec 2017 • Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N. Sainath, Zhifeng Chen, Rohit Prabhavalkar

Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, language model, and alignment mechanism.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving the Performance of Online Neural Transducer Models

no code implementations • 5 Dec 2017 • Tara N. Sainath, Chung-Cheng Chiu, Rohit Prabhavalkar, Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Zhifeng Chen

Neural transducer is a streaming sequence-to-sequence model, but has shown a significant degradation in performance compared to non-streaming models such as Listen, Attend and Spell (LAS).

Paper
Add Code

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

no code implementations • 5 Dec 2017 • Tara N. Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu, Zhifeng Chen, Chung-Cheng Chiu

However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units.

Language Modelling

Paper
Add Code

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

2 code implementations • 5 Dec 2017 • Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan

Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

4 code implementations • 5 Dec 2017 • Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

573

Paper
Code

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

no code implementations • 15 Nov 2017 • Chris Donahue, Bo Li, Rohit Prabhavalkar

We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models

no code implementations • 26 Oct 2017 • Yanzhang He, Rohit Prabhavalkar, Kanishka Rao, Wei Li, Anton Bakhtin, Ian McGraw

We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components.

General Classification Language Modelling +1

Paper
Add Code

On the efficient representation and execution of deep acoustic models

no code implementations • 15 Jul 2016 • Raziel Alvarez, Rohit Prabhavalkar, Anton Bakhtin

In this paper we present a simple and computationally efficient quantization scheme that enables us to reduce the resolution of the parameters of a neural network from 32-bit floating point values to 8-bit integer values.

Quantization speech-recognition +1

Paper
Add Code

On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition

no code implementations • 25 Mar 2016 • Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, Ian McGraw

We study the problem of compressing recurrent neural networks (RNNs).

Model Compression speech-recognition +1

Paper
Add Code

Personalized Speech recognition on mobile devices

no code implementations • 10 Mar 2016 • Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Francoise Beaufays, Carolina Parada

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone.

Language Modelling speech-recognition +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.