Search Results for author: Jinyu Li

Found 124 papers, 29 papers with code

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

no code implementations10 Apr 2024 Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers.

Dialogue Generation

WavLLM: Towards Robust and Adaptive Speech Large Language Model

no code implementations31 Mar 2024 Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach.

Language Modelling Large Language Model

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

no code implementations5 Mar 2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.

Quantization Speech Synthesis

Boosting Large Language Model for Speech Synthesis: An Empirical Study

no code implementations30 Dec 2023 Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder.

Language Modelling Large Language Model +2

COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

no code implementations3 Nov 2023 Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

With fewer than 20M trainable parameters and as little as 450 hours of English speech data for SQA generation, COSMIC exhibits emergent instruction-following and in-context learning capabilities in speech-to-text tasks.

Domain Adaptation In-Context Learning +4

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

no code implementations23 Oct 2023 Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li, Yashesh Gaur

The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in Dynamic Environments

1 code implementation23 Oct 2023 Jinyu Li, Xiaokun Pan, Gan Huang, Ziyang Zhang, Nan Wang, Hujun Bao, Guofeng Zhang

In this work, we design a novel visual-inertial odometry (VIO) system called RD-VIO to handle both of these two problems.

Enhanced Edge-Perceptual Guided Image Filtering

no code implementations16 Oct 2023 Jinyu Li

Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing.

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

no code implementations3 Oct 2023 Yiming Wang, Jinyu Li

In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure.

speech-recognition Speech Recognition

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

no code implementations15 Sep 2023 Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

DiariST: Streaming Speech Translation with Speaker Diarization

1 code implementation14 Sep 2023 Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka

End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion.

speaker-diarization Speaker Diarization +3

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

no code implementations14 Aug 2023 Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech.

Language Modelling Multi-Task Learning +2

Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text

no code implementations30 Jul 2023 Eric Sun, Jinyu Li, Jian Xue, Yifan Gong

When mixing 20, 000 hours augmented speech data generated by our method with 12, 500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8. 7% relative word error rate reduction.

Automatic Speech Recognition Data Augmentation +2

On decoder-only architecture for speech-to-text and large language model integration

no code implementations8 Jul 2023 Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language.

Language Modelling Large Language Model +1

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

no code implementations7 Jul 2023 Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur

In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Accelerating Transducers through Adjacent Token Merging

no code implementations28 Jun 2023 Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

no code implementations28 Jun 2023 Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM).

Domain Adaptation Language Modelling +3

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

no code implementations31 May 2023 Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu

In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

no code implementations25 May 2023 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.

Language Modelling Multi-Task Learning +3

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

1 code implementation CVPR 2023 Jinyu Li, Chenxu Luo, Xiaodong Yang

In order to deal with the sparse and unstructured raw point clouds, LiDAR based 3D object detection research mostly focuses on designing dedicated local point aggregators for fine-grained geometrical modeling.

3D Object Detection Object +1

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

no code implementations1 Mar 2023 Eric Sun, Jinyu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin, Yifan Gong

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference.

Language Identification

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

no code implementations22 Feb 2023 Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Sheng Zhao

To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Speaker Change Detection for Transformer Transducer ASR

no code implementations16 Feb 2023 Jian Wu, Zhuo Chen, Min Hu, Xiong Xiao, Jinyu Li

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

no code implementations21 Nov 2022 Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.

Audio-Visual Speech Recognition Language Modelling +3

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

no code implementations17 Nov 2022 Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Speech separation with large-scale self-supervised learning

no code implementations9 Nov 2022 Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.

Self-Supervised Learning Speech Separation

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

no code implementations4 Nov 2022 Jian Xue, Peidong Wang, Jinyu Li, Eric Sun

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language.

Machine Translation speech-recognition +2

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

1 code implementation31 Oct 2022 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei

However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.

Speech-to-Speech Translation Translation

Simulating realistic speech overlaps improves multi-talker ASR

no code implementations27 Oct 2022 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

no code implementations16 Oct 2022 Ruchao Fan, Guoli Ye, Yashesh Gaur, Jinyu Li

As a result, we reduce the WER of a streaming TT from 7. 6% to 6. 5% on the Librispeech test-other data and the CER from 7. 3% to 6. 1% on the Aishell test data, respectively.

Language Modelling speech-recognition +1

CTCBERT: Advancing Hidden-unit BERT with CTC Objectives

no code implementations16 Oct 2022 Ruchao Fan, Yiming Wang, Yashesh Gaur, Jinyu Li

We examine CTCBERT on IDs from HuBERT Iter1, HuBERT Iter2, and PBERT.

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

1 code implementation30 Sep 2022 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, LiRong Dai, Jinyu Li, Furu Wei

In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.

Language Modelling speech-recognition +1

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

no code implementations12 Sep 2022 Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, Jinyu Li, Takuya Yoshioka

To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

DeFlowSLAM: Self-Supervised Scene Motion Decomposition for Dynamic Dense SLAM

1 code implementation18 Jul 2022 Weicai Ye, Xingyuan Yu, Xinyue Lan, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene.

Pose Estimation Simultaneous Localization and Mapping

Ultra Fast Speech Separation Model with Teacher Student Learning

no code implementations27 Apr 2022 Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).

Computational Efficiency Speech Separation

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

no code implementations27 Apr 2022 Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition.

Self-Supervised Learning Speaker Recognition +3

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

1 code implementation30 Mar 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems

1 code implementation2 Mar 2022 Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Veljko Miljanic, Sheng Zhao, Hosam Khalil

In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

1 code implementation2 Feb 2022 Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Endpoint Detection for Streaming End-to-End Multi-talker ASR

no code implementations24 Jan 2022 Liang Lu, Jinyu Li, Yifan Gong

Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.

Sentence speech-recognition +2

Self-Supervised Learning for speech recognition with Intermediate layer supervision

1 code implementation16 Dec 2021 Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.

Language Modelling Self-Supervised Learning +2

Recent Advances in End-to-End Automatic Speech Recognition

no code implementations2 Nov 2021 Jinyu Li

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Continuous Speech Separation with Recurrent Selective Attention Network

no code implementations28 Oct 2021 Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li

In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting.

speech-recognition Speech Recognition +1

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

3 code implementations ACL 2022 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

no code implementations11 Oct 2021 Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition

no code implementations10 Oct 2021 Guoli Ye, Vadim Mazalov, Jinyu Li, Yifan Gong

Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results.

speech-recognition Speech Recognition

Factorized Neural Transducer for Efficient Language Model Adaptation

1 code implementation27 Sep 2021 Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li

In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

no code implementations17 Sep 2021 Desh Raj, Liang Lu, Zhuo Chen, Yashesh Gaur, Jinyu Li

Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions.

Speech Separation

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

no code implementations17 Aug 2021 Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li

We incorporate the context information into the spelling correction model with a shared context encoder and use a filtering algorithm to handle large-size context lists.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

A Configurable Multilingual Model is All You Need to Recognize All Languages

no code implementations13 Jul 2021 Long Zhou, Jinyu Li, Eric Sun, Shujie Liu

Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

no code implementations12 Jul 2021 Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei

Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.

Self-Supervised Learning speech-recognition +1

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

no code implementations5 Jul 2021 Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

no code implementations4 Jun 2021 Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.

Language Modelling speech-recognition +1

On Addressing Practical Challenges for RNN-Transducer

no code implementations27 Apr 2021 Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.

speech-recognition Speech Recognition

Streaming Multi-talker Speech Recognition with Joint Speaker Identification

no code implementations5 Apr 2021 Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications.

Speaker Identification speech-recognition +2

Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

no code implementations2 Feb 2021 Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Streaming end-to-end multi-talker speech recognition

no code implementations26 Nov 2020 Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions.

speech-recognition Speech Recognition

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

no code implementations3 Nov 2020 Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

no code implementations23 Oct 2020 Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion.

Language Modelling speech-recognition +1

Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

1 code implementation23 Oct 2020 Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li

With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.

Speech Separation

Speaker Separation Using Speaker Inventories and Estimated Speech

no code implementations20 Oct 2020 Peidong Wang, Zhuo Chen, DeLiang Wang, Jinyu Li, Yifan Gong

We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation.

Speaker Separation Speech Extraction +2

An End-to-end Architecture of Online Multi-channel Speech Separation

no code implementations7 Sep 2020 Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie

Previously, we introduced a sys-tem, calledunmixing, fixed-beamformerandextraction(UFE), that was shown to be effective in addressing the speech over-lap problem in conversation transcription.

speech-recognition Speech Recognition +1

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

1 code implementation14 Aug 2020 Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation.

Data Augmentation Domain Adaptation +2

Continuous Speech Separation with Conformer

1 code implementation13 Aug 2020 Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.

 Ranked #1 on Speech Separation on LibriCSS (using extra training data)

Speech Separation

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

no code implementations12 Aug 2020 Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li

Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

no code implementations30 Jul 2020 Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Exploring Transformers for Large-Scale Speech Recognition

no code implementations19 May 2020 Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong

While recurrent neural networks still largely define state-of-the-art speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition.

speech-recognition Speech Recognition

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

no code implementations1 May 2020 Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

L-Vector: Neural Label Embedding for Domain Adaptation

no code implementations25 Apr 2020 Zhong Meng, Hu Hu, Jinyu Li, Changliang Liu, Yan Huang, Yifan Gong, Chin-Hui Lee

We propose a novel neural label embedding (NLE) scheme for the domain adaptation of a deep neural network (DNN) acoustic model with unpaired data samples from source and target domains.

Domain Adaptation

Low Latency End-to-End Streaming Speech Recognition with a Scout Network

no code implementations23 Mar 2020 Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode.

Audio and Speech Processing

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

no code implementations17 Mar 2020 Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Continuous speech separation: dataset and analysis

1 code implementation30 Jan 2020 Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li

In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

no code implementations6 Jan 2020 Zhong Meng, Jinyu Li, Yashesh Gaur, Yifan Gong

In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.

speech-recognition Speech Recognition +2

Character-Aware Attention-Based End-to-End Speech Recognition

no code implementations6 Jan 2020 Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion.

speech-recognition Speech Recognition

Semantic Mask for Transformer based End-to-End Speech Recognition

1 code implementation6 Dec 2019 Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

no code implementations9 Nov 2019 Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Adversarial Speaker Verification

no code implementations29 Apr 2019 Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong

The use of deep networks to extract embeddings for speaker recognition has proven successfully.

General Classification Speaker Recognition +1

Adversarial Speaker Adaptation

no code implementations29 Apr 2019 Zhong Meng, Jinyu Li, Yifan Gong

We propose a novel adversarial speaker adaptation (ASA) scheme, in which adversarial learning is applied to regularize the distribution of deep hidden features in a speaker-dependent (SD) deep neural network (DNN) acoustic model to be close to that of a fixed speaker-independent (SI) DNN acoustic model during adaptation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Conditional Teacher-Student Learning

no code implementations28 Apr 2019 Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong

To overcome this problem, we propose a conditional T/S learning scheme, in which a "smart" student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth.

Domain Adaptation Model Compression

Attentive Adversarial Learning for Domain-Invariant Training

no code implementations28 Apr 2019 Zhong Meng, Jinyu Li, Yifan Gong

Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Speaker Adaptation for End-to-End CTC Models

no code implementations4 Jan 2019 Ke Li, Jinyu Li, Yong Zhao, Kshitiz Kumar, Yifan Gong

We propose two approaches for speaker adaptation in end-to-end (E2E) automatic speech recognition systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

no code implementations31 Dec 2018 Amit Das, Jinyu Li, Guoli Ye, Rui Zhao, Yifan Gong

In particular, we introduce Attention CTC, Self-Attention CTC, Hybrid CTC, and Mixed-unit CTC.

Language Modelling

Cycle-Consistent Speech Enhancement

no code implementations6 Sep 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this paper, we propose a cycle-consistent speech enhancement (CSE) in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones.

Multi-Task Learning Speech Enhancement

Adversarial Feature-Mapping for Speech Enhancement

no code implementations6 Sep 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

To achieve better performance on ASR task, senone-aware (SA) AFM is further proposed in which an acoustic model network is jointly trained with the feature-mapping and discriminator networks to optimize the senone classification loss in addition to the AFM losses.

Speech Enhancement

Layer Trajectory LSTM

no code implementations28 Aug 2018 Jinyu Li, Changliang Liu, Yifan Gong

In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM.

Recent Progresses in Deep Learning based Acoustic Models (Updated)

no code implementations25 Apr 2018 Dong Yu, Jinyu Li

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.

General Classification Speech Enhancement +2

Developing Far-Field Speaker System Via Teacher-Student Learning

no code implementations14 Apr 2018 Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, Yifan Gong

In this study, we develop the keyword spotting (KWS) and acoustic model (AM) components in a far-field speaker system.

Keyword Spotting Model Compression

Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation

no code implementations2 Apr 2018 Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this method, a student acoustic model and a condition classifier are jointly optimized to minimize the Kullback-Leibler divergence between the output distributions of the teacher and student models, and simultaneously, to min-maximize the condition classification loss.

Transfer Learning Unsupervised Domain Adaptation

Speaker-Invariant Training via Adversarial Learning

no code implementations2 Apr 2018 Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang, Juang

We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system.

General Classification Multi-Task Learning

Advancing Connectionist Temporal Classification With Attention Modeling

no code implementations15 Mar 2018 Amit Das, Jinyu Li, Rui Zhao, Yifan Gong

In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework.

Classification General Classification +3

Advancing Acoustic-to-Word CTC Model

no code implementations15 Mar 2018 Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, Yifan Gong

However, the word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.

Language Modelling

Acoustic-To-Word Model Without OOV

no code implementations28 Nov 2017 Jinyu Li, Guoli Ye, Rui Zhao, Jasha Droppo, Yifan Gong

However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.

Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition

no code implementations21 Nov 2017 Zhong Meng, Zhuo Chen, Vadim Mazalov, Jinyu Li, Yifan Gong

Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Improved training for online end-to-end speech recognition systems

1 code implementation6 Nov 2017 Suyoun Kim, Michael L. Seltzer, Jinyu Li, Rui Zhao

Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.

speech-recognition Speech Recognition

Large-Scale Domain Adaptation via Teacher-Student Learning

no code implementations17 Aug 2017 Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, Yifan Gong

High accuracy speech recognition requires a large amount of transcribed data for supervised training.

Domain Adaptation speech-recognition +1

Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition

no code implementations21 Jul 2017 Zhehuai Chen, Jasha Droppo, Jinyu Li, Wayne Xiong

We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

End-to-End Attention based Text-Dependent Speaker Verification

no code implementations3 Jan 2017 Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Yifan Gong

A new type of End-to-End system for text-dependent speaker verification is presented in this paper.

Text-Dependent Speaker Verification

Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

no code implementations12 Jan 2016 Pawel Swietojanski, Jinyu Li, Steve Renals

This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data.

speech-recognition Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.