Search Results for author: Boris Ginsburg

Found 95 papers, 34 papers with code

Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

no code implementations19 Feb 2025 Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness.

Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

no code implementations8 Jan 2025 Alexan Ayrapetyan, Sofia Kostandian, Ara Yeroyan, Mher Yerznkanyan, Nikolay Karpov, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg

This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube.

speech-recognition Speech Recognition

Star Attention: Efficient LLM Inference over Long Sequences

1 code implementation26 Nov 2024 Shantanu Acharya, Fei Jia, Boris Ginsburg

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism.

Computational Efficiency

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

no code implementations3 Oct 2024 Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg

We present \textbf{H}ybrid-\textbf{A}utoregressive \textbf{IN}ference Tr\textbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model.

speech-recognition Speech Recognition

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

no code implementations1 Oct 2024 Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere.

Representation Learning

EMMeTT: Efficient Multimodal Machine Translation Training

no code implementations20 Sep 2024 Piotr Żelasko, Zhehuai Chen, Mengru Wang, Daniel Galvez, Oleksii Hrinchuk, Shuoyang Ding, Ke Hu, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg

This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST).

automatic-speech-translation Decoder +3

Chain-of-Thought Prompting for Speech Translation

no code implementations17 Sep 2024 Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

1 code implementation10 Sep 2024 Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL.

speaker-diarization Speaker Diarization

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

no code implementations29 Jul 2024 Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg

Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive.

Code Generation

Romanization Encoding For Multilingual ASR

no code implementations5 Jul 2024 Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

no code implementations3 Jul 2024 Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

no code implementations28 Jun 2024 Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities.

Decoder Language Modeling +1

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

no code implementations27 Jun 2024 Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee

Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs).

Descriptive Instruction Following

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

no code implementations25 Jun 2024 Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.

Decoder Language Modeling +4

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

no code implementations18 Jun 2024 Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs.

Synthetic Data Generation Text to Speech

Nemotron-4 340B Technical Report

1 code implementation17 Jun 2024 Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick Legresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, Chen Zhu

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward.

Synthetic Data Generation

Label-Looping: Highly Efficient Decoding for Transducers

1 code implementation10 Jun 2024 Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models.

speech-recognition Speech Recognition

Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis

no code implementations7 Jun 2024 Ryan Langman, Ante Jukić, Kunal Dhawan, Nithin Rao Koluguri, Boris Ginsburg

Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS).

Speech Synthesis Text to Speech

Flexible Multichannel Speech Enhancement for Noise-Robust Frontend

no code implementations6 Jun 2024 Ante Jukić, Jagadeesh Balam, Boris Ginsburg

This paper proposes a flexible multichannel speech enhancement system with the main goal of improving robustness of automatic speech recognition (ASR) in noisy conditions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

RULER: What's the Real Context Size of Your Long-Context Language Models?

6 code implementations9 Apr 2024 Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg

Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases.

Long-Context Understanding

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

1 code implementation27 Dec 2023 Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model.

Automatic Speech Recognition Decoder +2

The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

no code implementations18 Oct 2023 Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays.

Automatic Speech Recognition speaker-diarization +3

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

no code implementations14 Oct 2023 Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models.

Self-Supervised Learning Speaker Verification +2

LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models

no code implementations4 Oct 2023 Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

Traditional automatic speech recognition (ASR) models output lower-cased words without punctuation marks, which reduces readability and necessitates a subsequent text processing model to convert ASR transcripts into a proper format.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Chat About Boring Problems: Studying GPT-based text normalization

no code implementations23 Sep 2023 Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg

Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work.

Prompt Engineering

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

no code implementations19 Sep 2023 Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg

Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain.

Language Modeling Language Modelling +5

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

2 code implementations9 Aug 2023 Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR).

Automatic Speech Recognition speech-recognition +1

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

no code implementations13 Jul 2023 He Huang, Jagadeesh Balam, Boris Ginsburg

We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90. 14% intent accuracy and 82. 27% SLURP-F1.

intent-classification Intent Classification +7

Confidence-based Ensembles of End-to-End Speech Recognition Models

no code implementations27 Jun 2023 Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg

Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.

Language Identification Model Selection +2

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

1 code implementation14 Jun 2023 Kunal Dhawan, Dima Rekesh, Boris Ginsburg

Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Powerful and Extensible WFST Framework for RNN-Transducer Losses

1 code implementation18 Mar 2023 Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg

This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

1 code implementation27 Feb 2023 Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg

We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

no code implementations9 Nov 2022 Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg

In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm.

Language Identification Spoken language identification

A Compact End-to-End Model with Local and Global Context for Spoken Language Identification

no code implementations27 Oct 2022 Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture.

Language Identification Spoken language identification

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization

no code implementations29 Jul 2022 Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg

The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

5 code implementations9 Jun 2022 Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.

Ranked #2 on Speech Synthesis on LibriTTS (using extra training data)

Audio Generation Audio Synthesis +4

Multi-scale Speaker Diarization with Dynamic Scale Weighting

no code implementations30 Mar 2022 Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale.

Decoder speaker-diarization +1

Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization

2 code implementations29 Mar 2022 Evelina Bakhturina, Yang Zhang, Boris Ginsburg

First, a non-deterministic WFST outputs all normalization candidates, and then a neural language model picks the best one -- similar to shallow fusion for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Adapting TTS models For New Speakers using Transfer Learning

no code implementations12 Oct 2021 Paarth Neekhara, Jason Li, Boris Ginsburg

We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data.

Text to Speech Transfer Learning +1

CTC Variations Through New WFST Topologies

no code implementations6 Oct 2021 Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg

This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Unified Transformer-based Framework for Duplex Text Normalization

no code implementations23 Aug 2021 Tuan Manh Lai, Yang Zhang, Evelina Bakhturina, Boris Ginsburg, Heng Ji

In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services

no code implementations17 May 2021 Yang Zhang, Vahid Noroozi, Evelina Bakhturina, Boris Ginsburg

In this paper, we propose SGD-QA, a simple and extensible model for schema-guided dialogue state tracking based on a question answering approach.

Dialogue State Tracking Goal-Oriented Dialogue Systems +1

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

1 code implementation16 Apr 2021 Stanislav Beliaev, Boris Ginsburg

We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction.

Speech Synthesis Text to Speech

NeMo Inverse Text Normalization: From Development To Production

1 code implementation11 Apr 2021 Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg

Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Toolbox for Construction and Analysis of Speech Datasets

1 code implementation11 Apr 2021 Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

1 code implementation5 Apr 2021 Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.

speech-recognition Speech Recognition

Hi-Fi Multi-Speaker English TTS Dataset

no code implementations3 Apr 2021 Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang

This paper introduces a new multi-speaker English dataset for training text-to-speech models.

Text to Speech

On regularization of gradient descent, layer imbalance and flat minima

no code implementations18 Jul 2020 Boris Ginsburg

We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution.

Data Augmentation

Jasper: An End-to-End Convolutional Neural Acoustic Model

10 code implementations5 Apr 2019 Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.

Decoder Language Modeling +2

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

no code implementations2 Nov 2018 Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Training Deep AutoEncoders for Recommender Systems

no code implementations ICLR 2018 Oleksii Kuchaiev, Boris Ginsburg

Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.

Recommendation Systems

Large Batch Training of Convolutional Networks

12 code implementations13 Aug 2017 Yang You, Igor Gitman, Boris Ginsburg

Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

8k

Training Deep AutoEncoders for Collaborative Filtering

10 code implementations5 Aug 2017 Oleksii Kuchaiev, Boris Ginsburg

Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training.

Collaborative Filtering Recommendation Systems

Factorization tricks for LSTM networks

2 code implementations31 Mar 2017 Oleksii Kuchaiev, Boris Ginsburg

We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is "matrix factorization by design" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.