Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.
We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition.
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T).
Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have been successfully applied to ASR N-best rescoring.
We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR).
In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.
We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.
The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts.
Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.
Automatic speech recognition (ASR) is a capability which enables a program to process human speech into a written form.
no code implementations • 27 Aug 2021 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).
By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.
End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently.
Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5. 9% and 11. 5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
Ranked #1 on Speech Recognition on Switchboard + Hub500
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU).
The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.
The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning.
no code implementations • 4 Feb 2020 • Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny
Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.
It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training.
Ranked #2 on Speech Recognition on swb_hub_500 WER fullSWBCH
This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech.
On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale.
no code implementations • 30 Apr 2019 • Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko
With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition.
We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7. 6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13. 1% WER on the CallHome (CH) test set.
This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple.
An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling.
Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks.
Our CTC word model achieves a word error rate of 13. 0%/18. 8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9. 6%/16. 0% for phone-based CTC with a 4-gram LM.
no code implementations • 6 Mar 2017 • George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall
This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates?
Ranked #3 on Speech Recognition on Switchboard + Hub500
We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6. 6% on the Switchboard subset of the Hub5 2000 evaluation testset.
Ranked #5 on Speech Recognition on swb_hub_500 WER fullSWBCH
We describe the latest improvements to the IBM English conversational telephone speech recognition system.
Ranked #11 on Speech Recognition on Switchboard + Hub500
We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.