Search Results for author: Jagadeesh Balam

Found 33 papers, 5 papers with code

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

no code implementations30 Sep 2024 Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models.

Instruction Following Language Modeling +1

EMMeTT: Efficient Multimodal Machine Translation Training

no code implementations20 Sep 2024 Piotr Żelasko, Zhehuai Chen, Mengru Wang, Daniel Galvez, Oleksii Hrinchuk, Shuoyang Ding, Ke Hu, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg

This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST).

automatic-speech-translation Decoder +3

Chain-of-Thought Prompting for Speech Translation

no code implementations17 Sep 2024 Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

1 code implementation10 Sep 2024 Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL.

speaker-diarization Speaker Diarization

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

no code implementations29 Jul 2024 Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg

Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive.

Code Generation

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

no code implementations3 Jul 2024 Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

no code implementations28 Jun 2024 Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities.

Decoder Language Modeling +1

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

no code implementations18 Jun 2024 Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs.

Synthetic Data Generation Text to Speech

Flexible Multichannel Speech Enhancement for Noise-Robust Frontend

no code implementations6 Jun 2024 Ante Jukić, Jagadeesh Balam, Boris Ginsburg

This paper proposes a flexible multichannel speech enhancement system with the main goal of improving robustness of automatic speech recognition (ASR) in noisy conditions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

1 code implementation27 Dec 2023 Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model.

Automatic Speech Recognition Decoder +2

The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

no code implementations18 Oct 2023 Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays.

Automatic Speech Recognition speaker-diarization +3

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

no code implementations19 Sep 2023 Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg

Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain.

Language Modeling Language Modelling +5

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

1 code implementation11 Sep 2023 Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam

In addition, these findings point to the potential of using LLMs to improve speaker diarization and other speech processing tasks by capturing semantic and contextual cues.

speaker-diarization Speaker Diarization

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

no code implementations13 Jul 2023 He Huang, Jagadeesh Balam, Boris Ginsburg

We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90. 14% intent accuracy and 82. 27% SLURP-F1.

intent-classification Intent Classification +7

A Compact End-to-End Model with Local and Global Context for Spoken Language Identification

no code implementations27 Oct 2022 Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture.

Language Identification Spoken language identification

Multi-scale Speaker Diarization with Dynamic Scale Weighting

no code implementations30 Mar 2022 Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale.

Decoder speaker-diarization +1

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

1 code implementation5 Apr 2021 Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.

speech-recognition Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.