Search Results for author: Yanzhang He

Found 26 papers, 6 papers with code

Improving Rare Word Recognition with LM-aware MWER Training

no code implementations15 Apr 2022 Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups.

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

no code implementations8 Apr 2022 Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw

Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers.

Action Detection Activity Detection +1

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

no code implementations29 Mar 2022 Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios.

Automatic Speech Recognition Quantization

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

no code implementations24 Feb 2022 Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker.

Speaker Verification Speech Recognition

Cross-attention conformer for context modeling in speech enhancement for ASR

no code implementations30 Oct 2021 Arun Narayanan, Chung-Cheng Chiu, Tom O'Malley, Quan Wang, Yanzhang He

This work introduces \emph{cross-attention conformer}, an attention-based architecture for context modeling in speech enhancement.

Automatic Speech Recognition Speech Enhancement

Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

no code implementations7 Oct 2021 Qiujia Li, Yu Zhang, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland

As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems.

Automatic Speech Recognition

Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning

no code implementations1 Oct 2021 Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim, Trevor Strohman, Françoise Beaufays, Yanzhang He

Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance.

Domain Adaptation

Tied & Reduced RNN-T Decoder

no code implementations15 Sep 2021 Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He

Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003. 07705 [eess. AS], [2], arXiv:2012. 06749 [cs. CL]).

Language Modelling

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

no code implementations2 Jul 2021 Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass.

Automatic Speech Recognition Text-Independent Speaker Verification

Personalized Keyphrase Detection using Speaker and Environment Information

no code implementations28 Apr 2021 Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng, Huang, Arun Narayanan, Ian McGraw

In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary.

Automatic Speech Recognition Speaker Separation +1

Learning Word-Level Confidence For Subword End-to-End ASR

no code implementations11 Mar 2021 David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR).

Automatic Speech Recognition Model Selection

Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

no code implementations12 Dec 2020 Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR).

Automatic Speech Recognition

A Better and Faster End-to-End Model for Streaming ASR

no code implementations21 Nov 2020 Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR.

Audio and Speech Processing Sound

Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

1 code implementation22 Oct 2020 Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman

For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions.

Automatic Speech Recognition

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

1 code implementation21 Oct 2020 Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

FastEmit also improves streaming ASR accuracy from 4. 4%/8. 9% to 3. 1%/7. 5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.

Automatic Speech Recognition Word Alignment

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

1 code implementation9 Sep 2020 Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein

We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system.

Speech Recognition

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

no code implementations30 Aug 2020 Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He

The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model.

Speech Recognition

Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer

no code implementations2 Jun 2020 Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Francoise Beaufays

The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate.

Automatic Speech Recognition

Towards Fast and Accurate Streaming End-to-End ASR

no code implementations24 Apr 2020 Bo Li, Shuo-Yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu

RNN-T EP+LAS, together with MWER training brings in 18. 7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP model.

Audio and Speech Processing

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

no code implementations28 Mar 2020 Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-Yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i. e., word error rate (WER), and latency, i. e., the time the hypothesis is finalized after the user stops speaking.

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

3 code implementations21 Feb 2019 Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models

no code implementations26 Oct 2017 Yanzhang He, Rohit Prabhavalkar, Kanishka Rao, Wei Li, Anton Bakhtin, Ian McGraw

We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components.

General Classification Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.