Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data.
On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline.
We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems.
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy.
We also show that the proposed distillation can reduce the WER gap between the student and the teacher by 62% upto 100%.
In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR.
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation.
no code implementations • 12 May 2023 • Suhaila M. Shakiah, Rupak Vignesh Swaminathan, Hieu Duy Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, Athanasios Mouchtaris, Grant P. Strimel, Ariya Rastrow
Machine learning model weights and activations are represented in full-precision during training.
Streaming speech recognition architectures are employed for low-latency, real-time applications.
no code implementations • 3 Apr 2023 • Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree M. Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann
We present dual-attention neural biasing, an architecture designed to boost Wake Words (WW) recognition and improve inference time latency on speech recognition tasks.
End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places.
Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning).
no code implementations • 19 Jul 2022 • Gopinath Chennupati, Milind Rao, Gurpreet Chadha, Aaron Eakin, Anirudh Raju, Gautam Tiwari, Anit Kumar Sahu, Ariya Rastrow, Jasha Droppo, Andy Oberlin, Buddha Nandanoor, Prahalad Venkataramanan, Zheng Wu, Pankaj Sitpure
For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge.
We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization.
We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators.
Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored.
no code implementations • 13 Dec 2021 • Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel
Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio.
We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training.
In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription.
Ranked #13 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)
We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks.
We apply AmNets to the Recurrent Neural Network Transducer (RNN-T) to reduce compute cost and latency for an automatic speech recognition (ASR) task.
Ranked #51 on Speech Recognition on LibriSpeech test-clean
As more speech processing applications execute locally on edge devices, a set of resource constraints must be considered.
Spoken language understanding (SLU) systems extract both text transcripts and semantics associated with intents and slots from input speech utterances.
An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable.
When applied to a large de-identified dataset of utterances collected by a popular voice assistant platform, our method reduces perplexity by 7. 0% relative over a standard LM that does not incorporate contextual information.
On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3. 33% relative WERR.
However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2. 0 network from the quantized representations in a way similar to a VQ-VAE model.
no code implementations • 15 Feb 2021 • Aditya Gourav, Linda Liu, Ankur Gandhe, Yile Gu, Guitang Lan, Xiangyang Huang, Shashank Kalmane, Gautam Tiwari, Denis Filimonov, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko
We also describe a novel second-pass de-biasing approach: used in conjunction with a first-pass shallow fusion that optimizes on oracle WER, we can achieve an additional 14% improvement on personalized content recognition, and even improve accuracy for the general use case by up to 2. 5%.
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech, and are essential components of voice activated systems.
In the literature, such methods are referred to as sparse pruning.
As voice assistants become more ubiquitous, they are increasingly expected to support and perform well on a wide variety of use-cases across different domains.
Accents mismatching is a critical problem for end-to-end ASR.
We show that this simple method can improve performance on rare words by 3. 7% WER relative without degradation on general test set, and the improvement from USF is additive to any additional language model based rescoring.
Finally, we contrast these methods to a jointly trained end-to-end joint SLU model, consisting of ASR and NLU subsystems which are connected by a neural network based interface instead of text, that produces transcripts as well as NLU interpretation.
Decomposing models into multiple components is critically important in many applications such as language modeling (LM) as it enables adapting individual components separately and biasing of some components to the user's personal preferences.
no code implementations • 8 Jul 2020 • Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann
Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies.
Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like "e c h o" to "E k oU").
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language.
In this work, we propose to combine the benefits of end-to-end approaches with a conventional system using an attention-based discriminative language model that learns to rescore the output of a first-pass ASR system.
Neural language models (NLM) have been shown to outperform conventional n-gram language models by a substantial margin in Automatic Speech Recognition (ASR) and other tasks.
New application intents and interaction types are released for these systems over time, imposing challenges to adapt the LMs since the existing training data is no longer sufficient to model the future user interactions.
We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech.
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants.
Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents.
no code implementations • 1 Nov 2017 • Anjishnu Kumar, Arpit Gupta, Julian Chan, Sam Tucker, Bjorn Hoffmeister, Markus Dreyer, Stanislav Peshterliev, Ankur Gandhe, Denis Filiminov, Ariya Rastrow, Christian Monson, Agnika Kumar
This paper presents the design of the machine learning architecture that underlies the Alexa Skills Kit (ASK) a large scale Spoken Language Understanding (SLU) Software Development Kit (SDK) that enables developers to extend the capabilities of Amazon's virtual assistant, Alexa.