Search Results for author: Boris Ginsburg

Found 61 papers, 26 papers with code

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

1 code implementation27 Dec 2023 Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model.

Automatic Speech Recognition speech-recognition +1

The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

no code implementations18 Oct 2023 Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays.

Automatic Speech Recognition speaker-diarization +3

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

no code implementations14 Oct 2023 Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning and speaker verification models.

Self-Supervised Learning Speaker Verification +2

LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models

no code implementations4 Oct 2023 Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

Traditional automatic speech recognition (ASR) models output lower-cased words without punctuation marks, which reduces readability and necessitates a subsequent text processing model to convert ASR transcripts into a proper format.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Chat About Boring Problems: Studying GPT-based text normalization

no code implementations23 Sep 2023 Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg

Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work.

Prompt Engineering

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

no code implementations19 Sep 2023 Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg

Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain.

Language Modelling Quantization +4

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

2 code implementations9 Aug 2023 Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR).

Automatic Speech Recognition speech-recognition +1

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

no code implementations13 Jul 2023 He Huang, Jagadeesh Balam, Boris Ginsburg

We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90. 14% intent accuracy and 82. 27% SLURP-F1.

intent-classification Intent Classification +7

Confidence-based Ensembles of End-to-End Speech Recognition Models

no code implementations27 Jun 2023 Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg

Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.

Language Identification Model Selection +2

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

1 code implementation14 Jun 2023 Kunal Dhawan, Dima Rekesh, Boris Ginsburg

Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Powerful and Extensible WFST Framework for RNN-Transducer Losses

no code implementations18 Mar 2023 Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg

This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

no code implementations9 Nov 2022 Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg

In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm.

Language Identification Spoken language identification

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

1 code implementation1 Nov 2022 Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg

In the proposed approach, a few small adapter modules are added to the original network.

Speech Synthesis

A Compact End-to-End Model with Local and Global Context for Spoken Language Identification

no code implementations27 Oct 2022 Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture.

Language Identification Spoken language identification

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization

no code implementations29 Jul 2022 Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg

The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

3 code implementations9 Jun 2022 Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.

Audio Generation Audio Synthesis +4

Multi-scale Speaker Diarization with Dynamic Scale Weighting

no code implementations30 Mar 2022 Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale.

speaker-diarization Speaker Diarization

Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization

1 code implementation29 Mar 2022 Evelina Bakhturina, Yang Zhang, Boris Ginsburg

First, a non-deterministic WFST outputs all normalization candidates, and then a neural language model picks the best one -- similar to shallow fusion for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Adapting TTS models For New Speakers using Transfer Learning

no code implementations12 Oct 2021 Paarth Neekhara, Jason Li, Boris Ginsburg

We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data.

Transfer Learning Voice Cloning

CTC Variations Through New WFST Topologies

no code implementations6 Oct 2021 Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg

This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Unified Transformer-based Framework for Duplex Text Normalization

no code implementations23 Aug 2021 Tuan Manh Lai, Yang Zhang, Evelina Bakhturina, Boris Ginsburg, Heng Ji

In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services

no code implementations17 May 2021 Yang Zhang, Vahid Noroozi, Evelina Bakhturina, Boris Ginsburg

In this paper, we propose SGD-QA, a simple and extensible model for schema-guided dialogue state tracking based on a question answering approach.

Dialogue State Tracking Goal-Oriented Dialogue Systems +1

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

1 code implementation16 Apr 2021 Stanislav Beliaev, Boris Ginsburg

We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction.

Speech Synthesis

A Toolbox for Construction and Analysis of Speech Datasets

1 code implementation11 Apr 2021 Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

NeMo Inverse Text Normalization: From Development To Production

1 code implementation11 Apr 2021 Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg

Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

1 code implementation5 Apr 2021 Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.

speech-recognition Speech Recognition

Hi-Fi Multi-Speaker English TTS Dataset

no code implementations3 Apr 2021 Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang

This paper introduces a new multi-speaker English dataset for training text-to-speech models.

On regularization of gradient descent, layer imbalance and flat minima

no code implementations18 Jul 2020 Boris Ginsburg

We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution.

Data Augmentation

Jasper: An End-to-End Convolutional Neural Acoustic Model

10 code implementations5 Apr 2019 Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.

Language Modelling Speech Recognition

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

no code implementations2 Nov 2018 Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Training Deep AutoEncoders for Recommender Systems

no code implementations ICLR 2018 Oleksii Kuchaiev, Boris Ginsburg

Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.

Recommendation Systems

Large Batch Training of Convolutional Networks

12 code implementations13 Aug 2017 Yang You, Igor Gitman, Boris Ginsburg

Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

Training Deep AutoEncoders for Collaborative Filtering

10 code implementations5 Aug 2017 Oleksii Kuchaiev, Boris Ginsburg

Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.

Collaborative Filtering Recommendation Systems

Factorization tricks for LSTM networks

2 code implementations31 Mar 2017 Oleksii Kuchaiev, Boris Ginsburg

We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is "matrix factorization by design" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.