Search Results for author: Zhehuai Chen

Found 27 papers, 2 papers with code

Chain-of-Thought Prompting for Speech Translation

no code implementations17 Sep 2024 Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST).

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

no code implementations28 Jun 2024 Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities.

Decoder Language Modelling

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

no code implementations27 Jun 2024 Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee

Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs).

Descriptive Instruction Following

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

no code implementations18 Jun 2024 Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs.

Synthetic Data Generation

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

1 code implementation10 Feb 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result.

Machine Translation Speech-to-Speech Translation +1

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

no code implementations8 Jan 2024 Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Understanding Shared Speech-Text Representations

no code implementations27 Apr 2023 Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Accelerating RNN-T Training and Inference Using CTC guidance

no code implementations29 Oct 2022 Yongqiang Wang, Zhehuai Chen, Chengjian Zheng, Yu Zhang, Wei Han, Parisa Haghani

We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model.

Decoder

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

no code implementations18 Oct 2022 Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64. 8\% to 30. 8\%, a relative reduction of 53\%.

Representation Learning speech-recognition +2

JOIST: A Joint Speech and Text Streaming Model For ASR

no code implementations13 Oct 2022 Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works.

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

no code implementations16 May 2022 Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition.

Accented Speech Recognition Benchmarking +2

MAESTRO: Matched Speech Text Representations through Modality Matching

no code implementations7 Apr 2022 Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen

Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information.

Language Modelling Self-Supervised Learning +3

Injecting Text in Self-Supervised Speech Pretraining

no code implementations27 Aug 2021 Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text.

Contrastive Learning Language Modelling +2

End-to-end contextual speech recognition using class language models and a token passing decoder

no code implementations5 Dec 2018 Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L. Seltzer, Christian Fuegen

In this work, we focus on contextual speech recognition, which is particularly challenging for E2E models because it introduces significant mismatch between training and test data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting

no code implementations2 Aug 2018 Zhehuai Chen, Yanmin Qian, Kai Yu

The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches.

Keyword Spotting speech-recognition +1

Linguistic Search Optimization for Deep Learning Based LVCSR

no code implementations2 Aug 2018 Zhehuai Chen

Recent advances in deep learning based large vocabulary con- tinuous speech recognition (LVCSR) invoke growing demands in large scale speech transcription.

speech-recognition Speech Recognition

A GPU-based WFST Decoder with Exact Lattice Generation

no code implementations9 Apr 2018 Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs).

Decoder Scheduling

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

no code implementations3 Mar 2018 Zhehuai Chen, Qi Liu, Hao Li, Kai Yu

Finally, modules are integrated into an acousticsto-word model (A2W) and jointly optimized using acoustic data to retain the advantage of sequence modeling.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition

no code implementations21 Jul 2017 Zhehuai Chen, Jasha Droppo, Jinyu Li, Wayne Xiong

We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Cannot find the paper you are looking for? You can Submit a new open access paper.