Search Results for author: Brian Kingsbury

Found 56 papers, 10 papers with code

Exploring the limits of decoder-only models trained on public speech recognition corpora

no code implementations31 Jan 2024 Ankit Gupta, George Saon, Brian Kingsbury

The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines.

Decoder speech-recognition +1

Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization

1 code implementation13 Jan 2024 A F M Saif, Xiaodong Cui, Han Shen, Songtao Lu, Brian Kingsbury, Tianyi Chen

In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Soft Random Sampling: A Theoretical and Empirical Analysis

no code implementations21 Nov 2023 Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei zhang, George Saon, Brian Kingsbury

Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data.

Automatic Speech Recognition speech-recognition +1

Semi-Autoregressive Streaming ASR With Label Context

no code implementations19 Sep 2023 Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

no code implementations21 May 2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.

High-Dimensional Smoothed Entropy Estimation via Dimensionality Reduction

no code implementations8 May 2023 Kristjan Greenewald, Brian Kingsbury, Yuancheng Yu

We study the problem of overcoming exponential sample complexity in differential entropy estimation under Gaussian convolutions.

Dimensionality Reduction Vocal Bursts Intensity Prediction

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

1 code implementation7 Oct 2022 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.

Knowledge Distillation Retrieval +2

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations3 Aug 2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

no code implementations29 Mar 2022 Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

no code implementations26 Feb 2022 Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.

Spoken Language Understanding

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

no code implementations26 Feb 2022 Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Improving End-to-End Models for Set Prediction in Spoken Language Understanding

no code implementations28 Jan 2022 Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts.

Data Augmentation Decoder +3

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

1 code implementation CVPR 2022 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne

In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.

Action Localization Retrieval +2

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

1 code implementation8 Dec 2021 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.

Action Localization Retrieval +2

Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent

no code implementations2 Dec 2021 Wei zhang, Mingrui Liu, Yu Feng, Xiaodong Cui, Brian Kingsbury, Yuhai Tu

We conduct extensive studies over 18 state-of-the-art DL models/tasks and demonstrate that DPSGD often converges in cases where SSGD diverges for large learning rates in the large batch setting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Asynchronous Decentralized Distributed Training of Acoustic Models

no code implementations21 Oct 2021 Xiaodong Cui, Wei zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung

Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

4-bit Quantization of LSTM-based Speech Recognition Models

no code implementations27 Aug 2021 Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Reducing Exposure Bias in Training Recurrent Neural Network Transducers

no code implementations24 Aug 2021 Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske

By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Integrating Dialog History into End-to-End Spoken Language Understanding Systems

no code implementations18 Aug 2021 Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently.

Intent Recognition Spoken Language Understanding

On the limit of English conversational speech recognition

no code implementations3 May 2021 Zoltán Tüske, George Saon, Brian Kingsbury

Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5. 9% and 11. 5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.

Decoder English Conversational Speech Recognition +2

Advancing RNN Transducer Technology for Speech Recognition

no code implementations17 Mar 2021 George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury

The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.

Language Modelling speech-recognition +1

Federated Acoustic Modeling For Automatic Speech Recognition

no code implementations8 Feb 2021 Xiaodong Cui, Songtao Lu, Brian Kingsbury

In this paper, we investigate federated acoustic modeling using data from multiple clients.

Federated Learning Speech Recognition Sound Distributed, Parallel, and Cluster Computing Audio and Speech Processing

Why Does Decentralized Training Outperform Synchronous Training In The Large Batch Setting?

no code implementations1 Jan 2021 Wei zhang, Mingrui Liu, Yu Feng, Brian Kingsbury, Yuhai Tu

We conduct extensive studies over 12 state-of-the-art DL models/tasks and demonstrate that DPSGD consistently outperforms SSGD in the large batch setting; and DPSGD converges in cases where SSGD diverges for large learning rates.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

no code implementations16 Nov 2020 Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury

Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation.

Spoken Language Understanding

Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

no code implementations8 Oct 2020 Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the text-to-intent data is converted into speech-to-intent data using a multi-speaker text-to-speech system.

Data Augmentation intent-classification +2

End-to-End Spoken Language Understanding Without Full Transcripts

no code implementations30 Sep 2020 Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words: there was little degradation when trained on just entities versus full transcripts.

Decoder slot-filling +4

Improving Efficiency in Large-Scale Decentralized Distributed Training

no code implementations4 Feb 2020 Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.

speech-recognition Speech Recognition

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

no code implementations20 Jan 2020 Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training.

Data Augmentation Language Modelling +2

A Kolmogorov Complexity Approach to Generalization in Deep Learning

no code implementations25 Sep 2019 Hazar Yueksel, Kush R. Varshney, Brian Kingsbury

Using this condition, we formulate an optimization problem to learn a more general classification function.

Classification General Classification +1

Challenging the Boundaries of Speech Recognition: The MALACH Corpus

no code implementations9 Aug 2019 Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech.

speech-recognition Speech Recognition

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

no code implementations10 Jul 2019 Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

English Broadcast News Speech Recognition by Humans and Machines

no code implementations30 Apr 2019 Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko

With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Distributed Deep Learning Strategies For Automatic Speech Recognition

no code implementations10 Apr 2019 Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7. 6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13. 1% WER on the CallHome (CH) test set.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Understanding Unequal Gender Classification Accuracy from Face Images

no code implementations30 Nov 2018 Vidya Muthukumar, Tejaswini Pedapati, Nalini Ratha, Prasanna Sattigeri, Chai-Wah Wu, Brian Kingsbury, Abhishek Kumar, Samuel Thomas, Aleksandra Mojsilovic, Kush R. Varshney

Recent work shows unequal performance of commercial face classification services in the gender classification task across intersectional groups defined by skin type and gender.

Classification Gender Classification +1

Estimating Information Flow in Deep Neural Networks

no code implementations12 Oct 2018 Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy

We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models.

Clustering

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

1 code implementation24 Jun 2018 Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Brian Kingsbury, Paolo DiAchille, Viatcheslav Gurev, Ravi Tejwani, Djallel Bouneffouf

Despite significant recent advances in deep neural networks, training them remains a challenge due to the highly non-convex nature of the objective function.

Building competitive direct acoustics-to-word models for English conversational speech recognition

no code implementations8 Dec 2017 Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Kernel Approximation Methods for Speech Recognition

no code implementations13 Jan 2017 Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurélien Bellet, Linxi Fan, Michael Collins, Daniel Hsu, Brian Kingsbury, Michael Picheny, Fei Sha

First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection.

feature selection speech-recognition +1

End-to-End ASR-free Keyword Search from Speech

no code implementations13 Jan 2017 Kartik Audhkhasi, Andrew Rosenberg, Abhinav Sethy, Bhuvana Ramabhadran, Brian Kingsbury

The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Very Deep Multilingual Convolutional Neural Networks for LVCSR

no code implementations29 Sep 2015 Tom Sercu, Christian Puhrsch, Brian Kingsbury, Yann Lecun

However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance.

speech-recognition Speech Recognition

Improvements to deep convolutional neural networks for LVCSR

no code implementations5 Sep 2013 Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.

Speech Recognition

Accelerating Hessian-free optimization for deep neural networks by implicit preconditioning and sampling

no code implementations5 Sep 2013 Tara N. Sainath, Lior Horesh, Brian Kingsbury, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian.

Deep Neural Networks for Acoustic Modeling in Speech Recognition

no code implementations Signal Processing Magazine 2012 Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input.

speech-recognition Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.