Search Results for author: George Saon

Found 36 papers, 1 papers with code

Exploring the limits of decoder-only models trained on public speech recognition corpora

no code implementations31 Jan 2024 Ankit Gupta, George Saon, Brian Kingsbury

The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines.

Decoder speech-recognition +1

Soft Random Sampling: A Theoretical and Empirical Analysis

no code implementations21 Nov 2023 Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei zhang, George Saon, Brian Kingsbury

Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data.

Automatic Speech Recognition speech-recognition +1

Semi-Autoregressive Streaming ASR With Label Context

no code implementations19 Sep 2023 Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems

no code implementations7 Sep 2023 Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon

However, existing works only transfer a single representation of LLM (e. g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different layers, contexts and models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Diagonal State Space Augmented Transformers for Speech Recognition

no code implementations27 Feb 2023 George Saon, Ankit Gupta, Xiaodong Cui

We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models.

speech-recognition Speech Recognition

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations3 Aug 2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

Extending RNN-T-based speech recognition systems with emotion and language classification

no code implementations28 Jul 2022 Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition.

Emotion Classification Emotion Recognition +3

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

no code implementations29 Mar 2022 Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

no code implementations26 Feb 2022 Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.

Spoken Language Understanding

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

no code implementations26 Feb 2022 Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Improving End-to-End Models for Set Prediction in Spoken Language Understanding

no code implementations28 Jan 2022 Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts.

Data Augmentation Decoder +3

Asynchronous Decentralized Distributed Training of Acoustic Models

no code implementations21 Oct 2021 Xiaodong Cui, Wei zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung

Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

4-bit Quantization of LSTM-based Speech Recognition Models

no code implementations27 Aug 2021 Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Reducing Exposure Bias in Training Recurrent Neural Network Transducers

no code implementations24 Aug 2021 Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske

By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Integrating Dialog History into End-to-End Spoken Language Understanding Systems

no code implementations18 Aug 2021 Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently.

Intent Recognition Spoken Language Understanding

On the limit of English conversational speech recognition

no code implementations3 May 2021 Zoltán Tüske, George Saon, Brian Kingsbury

Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5. 9% and 11. 5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.

Decoder English Conversational Speech Recognition +2

Advancing RNN Transducer Technology for Speech Recognition

no code implementations17 Mar 2021 George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury

The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.

Language Modelling speech-recognition +1

Improving Efficiency in Large-Scale Decentralized Distributed Training

no code implementations4 Feb 2020 Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.

speech-recognition Speech Recognition

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

no code implementations20 Jan 2020 Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training.

Data Augmentation Language Modelling +2

Challenging the Boundaries of Speech Recognition: The MALACH Corpus

no code implementations9 Aug 2019 Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech.

speech-recognition Speech Recognition

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

no code implementations10 Jul 2019 Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

English Broadcast News Speech Recognition by Humans and Machines

no code implementations30 Apr 2019 Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko

With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Distributed Deep Learning Strategies For Automatic Speech Recognition

no code implementations10 Apr 2019 Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7. 6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13. 1% WER on the CallHome (CH) test set.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Building competitive direct acoustics-to-word models for English conversational speech recognition

no code implementations8 Dec 2017 Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

no code implementations17 Oct 2017 Xiaodong Cui, Vaibhava Goel, George Saon

An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling.

speech-recognition Speech Recognition

Language Modeling with Highway LSTM

no code implementations19 Sep 2017 Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

no code implementations22 Mar 2017 Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

Our CTC word model achieves a word error rate of 13. 0%/18. 8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9. 6%/16. 0% for phone-based CTC with a 4-gram LM.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

The IBM 2016 English Conversational Telephone Speech Recognition System

no code implementations27 Apr 2016 George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6. 6% on the Switchboard subset of the Hub5 2000 evaluation testset.

Language Modelling speech-recognition +1

Improvements to deep convolutional neural networks for LVCSR

no code implementations5 Sep 2013 Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.

Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.