no code implementations • 18 Oct 2023 • Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg
We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays.
no code implementations • 18 Oct 2023 • Tae Jin Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg
This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection.
no code implementations • 14 Oct 2023 • Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning and speaker verification models.
1 code implementation • 13 Oct 2023 • Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 4 Oct 2023 • Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
Traditional automatic speech recognition (ASR) models output lower-cased words without punctuation marks, which reduces readability and necessitates a subsequent text processing model to convert ASR transcripts into a proper format.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 23 Sep 2023 • Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg
Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work.
no code implementations • 19 Sep 2023 • Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain.
no code implementations • 18 Sep 2023 • Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 9 Aug 2023 • Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg
We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR).
no code implementations • 13 Jul 2023 • He Huang, Jagadeesh Balam, Boris Ginsburg
We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90. 14% intent accuracy and 82. 27% SLURP-F1.
no code implementations • 27 Jun 2023 • Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg
Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.
1 code implementation • 14 Jun 2023 • Kunal Dhawan, Dima Rekesh, Boris Ginsburg
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 4 Jun 2023 • Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition (ASR) quality given user vocabulary.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 8 May 2023 • Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks.
1 code implementation • 13 Apr 2023 • Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg
TDT models for Speech Recognition achieve better accuracy and up to 2. 82X faster inference than conventional Transducers.
Ranked #1 on
Slot Filling
on SLURP
Intent Classification
Intent Classification and Slot Filling
+3
no code implementations • 18 Mar 2023 • Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.
no code implementations • 14 Mar 2023 • Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro
We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system.
no code implementations • 27 Feb 2023 • Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 16 Feb 2023 • Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris Ginsburg
In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
no code implementations • 16 Dec 2022 • Aleksandr Laptev, Boris Ginsburg
This paper presents a class of new fast non-trainable entropy-based confidence estimation methods for automatic speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 9 Nov 2022 • Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm.
1 code implementation • 4 Nov 2022 • Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg
This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 1 Nov 2022 • Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg
In the proposed approach, a few small adapter modules are added to the original network.
no code implementations • 27 Oct 2022 • Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture.
no code implementations • 6 Oct 2022 • Somshubra Majumdar, Shantanu Acharya, Vitaly Lavrukhin, Boris Ginsburg
Automatic speech recognition models are often adapted to improve their accuracy in a new domain.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 29 Jul 2022 • Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg
The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
3 code implementations • 9 Jun 2022 • Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon
Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.
Ranked #4 on
Speech Synthesis
on LibriTTS
no code implementations • 30 Mar 2022 • Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale.
1 code implementation • 29 Mar 2022 • Evelina Bakhturina, Yang Zhang, Boris Ginsburg
First, a non-deterministic WFST outputs all normalization candidates, and then a neural language model picks the best one -- similar to shallow fusion for automatic speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 12 Oct 2021 • Paarth Neekhara, Jason Li, Boris Ginsburg
We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data.
2 code implementations • 8 Oct 2021 • Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations.
Ranked #1 on
Speaker Diarization
on CH109
1 code implementation • 7 Oct 2021 • Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg
This paper describes Mixer-TTS, a non-autoregressive model for mel-spectrogram generation.
no code implementations • 6 Oct 2021 • Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg
This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 23 Aug 2021 • Tuan Manh Lai, Yang Zhang, Evelina Bakhturina, Boris Ginsburg, Heng Ji
In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 22 Jul 2021 • Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg
The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 17 May 2021 • Yang Zhang, Vahid Noroozi, Evelina Bakhturina, Boris Ginsburg
In this paper, we propose SGD-QA, a simple and extensible model for schema-guided dialogue state tracking based on a question answering approach.
1 code implementation • 16 Apr 2021 • Stanislav Beliaev, Boris Ginsburg
We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction.
1 code implementation • 11 Apr 2021 • Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg
Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 11 Apr 2021 • Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 5 Apr 2021 • Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.
Ranked #2 on
Speech Recognition
on SPGISpeech
no code implementations • 5 Apr 2021 • Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, Boris Ginsburg
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 3 Apr 2021 • Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang
This paper introduces a new multi-speaker English dataset for training text-to-speech models.
no code implementations • 18 Jul 2020 • Boris Ginsburg
We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution.
no code implementations • ICLR 2020 • Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay.
no code implementations • 23 Oct 2019 • Oleksii Hrinchuk, Mariya Popova, Boris Ginsburg
In this work, we introduce a simple yet efficient post-processing model for automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
15 code implementations • 22 Oct 2019 • Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang
We propose a new end-to-end neural acoustic model for automatic speech recognition.
Ranked #30 on
Speech Recognition
on LibriSpeech test-clean
Speech Recognition
Audio and Speech Processing
1 code implementation • 14 Sep 2019 • Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, Jonathan M. Cohen
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition.
Ranked #1 on
Speech Recognition
on Common Voice Spanish
(using extra training data)
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
3 code implementations • 27 May 2019 • Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay.
10 code implementations • 5 Apr 2019 • Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.
Ranked #3 on
Speech Recognition
on Hub5'00 SwitchBoard
no code implementations • 2 Nov 2018 • Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • WS 2018 • Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, Paulius Micikevicius
We present OpenSeq2Seq {--} an open-source toolkit for training sequence-to-sequence models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
3 code implementations • 25 May 2018 • Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Jason Li, Huyen Nguyen, Carl Case, Paulius Micikevicius
We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • ICLR 2018 • Boris Ginsburg, Igor Gitman, Yang You
Using LARS, we scaled AlexNet and ResNet-50 to a batch size of 16K.
no code implementations • ICLR 2018 • Oleksii Kuchaiev, Boris Ginsburg
Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.
8 code implementations • ICLR 2018 • Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x.
no code implementations • 24 Sep 2017 • Igor Gitman, Boris Ginsburg
However, it is not clear if these algorithms could replace BN in practical, large-scale applications.
12 code implementations • 13 Aug 2017 • Yang You, Igor Gitman, Boris Ginsburg
Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.
10 code implementations • 5 Aug 2017 • Oleksii Kuchaiev, Boris Ginsburg
Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.
2 code implementations • 31 Mar 2017 • Oleksii Kuchaiev, Boris Ginsburg
We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is "matrix factorization by design" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups.
Ranked #20 on
Language Modelling
on One Billion Word
1 code implementation • NeurIPS 2016 • Elad Richardson, Rom Herskovitz, Boris Ginsburg, Michael Zibulevsky
SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions.