Search Results for author: Jinyu Li

Found 124 papers, 29 papers with code

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

no code implementations • 10 Apr 2024 • Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers.

Dialogue Generation

Paper
Add Code

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code implementations • 4 Apr 2024 • Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Language Modelling Speech Synthesis +1

Paper
Add Code

WavLLM: Towards Robust and Adaptive Speech Large Language Model

no code implementations • 31 Mar 2024 • Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach.

Language Modelling Large Language Model

Paper
Add Code

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

no code implementations • 5 Mar 2024 • Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.

Quantization Speech Synthesis

Paper
Add Code

Boosting Large Language Model for Speech Synthesis: An Empirical Study

no code implementations • 30 Dec 2023 • Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder.

Language Modelling Large Language Model +2

Paper
Add Code

COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

no code implementations • 3 Nov 2023 • Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

With fewer than 20M trainable parameters and as little as 450 hours of English speech data for SQA generation, COSMIC exhibits emergent instruction-following and in-context learning capabilities in speech-to-text tasks.

Domain Adaptation In-Context Learning +4

Paper
Add Code

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

no code implementations • 23 Oct 2023 • Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li, Yashesh Gaur

The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in Dynamic Environments

1 code implementation • 23 Oct 2023 • Jinyu Li, Xiaokun Pan, Gan Huang, Ziyang Zhang, Nan Wang, Hujun Bao, Guofeng Zhang

In this work, we design a novel visual-inertial odometry (VIO) system called RD-VIO to handle both of these two problems.

367

Paper
Code

Enhanced Edge-Perceptual Guided Image Filtering

no code implementations • 16 Oct 2023 • Jinyu Li

Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing.

Paper
Add Code

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

no code implementations • 6 Oct 2023 • Junkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li

Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication.

Simultaneous Speech-to-Text Translation Translation

Paper
Add Code

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

no code implementations • 3 Oct 2023 • Yiming Wang, Jinyu Li

In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure.

speech-recognition Speech Recognition

Paper
Add Code

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

no code implementations • 15 Sep 2023 • Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

DiariST: Streaming Speech Translation with Speaker Diarization

1 code implementation • 14 Sep 2023 • Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka

End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion.

speaker-diarization Speaker Diarization +3

Paper
Code

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

no code implementations • 14 Aug 2023 • Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech.

Language Modelling Multi-Task Learning +2

Paper
Add Code

Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text

no code implementations • 30 Jul 2023 • Eric Sun, Jinyu Li, Jian Xue, Yifan Gong

When mixing 20, 000 hours augmented speech data generated by our method with 12, 500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8. 7% relative word error rate reduction.

Automatic Speech Recognition Data Augmentation +2

Paper
Add Code

On decoder-only architecture for speech-to-text and large language model integration

no code implementations • 8 Jul 2023 • Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language.

Language Modelling Large Language Model +1

Paper
Add Code

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

no code implementations • 7 Jul 2023 • Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur

In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Accelerating Transducers through Adjacent Token Merging

no code implementations • 28 Jun 2023 • Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

no code implementations • 28 Jun 2023 • Yuang Li, Yu Wu, Jinyu Li, Shujie Liu

Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM).

Domain Adaptation Language Modelling +3

Paper
Add Code

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

no code implementations • 31 May 2023 • Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu

In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

no code implementations • 25 May 2023 • Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.

Language Modelling Multi-Task Learning +3

Paper
Add Code

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

1 code implementation • CVPR 2023 • Jinyu Li, Chenxu Luo, Xiaodong Yang

In order to deal with the sparse and unstructured raw point clouds, LiDAR based 3D object detection research mostly focuses on designing dedicated local point aggregators for fine-grained geometrical modeling.

Ranked #1 on 3D Object Detection on waymo vehicle

3D Object Detection Object +1

155

Paper
Code

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

1 code implementation • 7 Mar 2023 • Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.

In-Context Learning Language Modelling +3

7,138

Paper
Code

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

no code implementations • 1 Mar 2023 • Eric Sun, Jinyu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin, Yifan Gong

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference.

Language Identification

Paper
Add Code

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

no code implementations • 22 Feb 2023 • Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Sheng Zhao

To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Speaker Change Detection for Transformer Transducer ASR

no code implementations • 16 Feb 2023 • Jian Wu, Zhuo Chen, Min Hu, Xiong Xiao, Jinyu Li

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

6 code implementations • 5 Jan 2023 • Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

In-Context Learning Language Modelling +2

32,377

Paper
Code

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

no code implementations • 5 Dec 2022 • Rui Zhao, Jian Xue, Partha Parthasarathy, Veljko Miljanic, Jinyu Li

Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability.

Language Modelling speech-recognition +1

Paper
Add Code

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

no code implementations • 21 Nov 2022 • Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.

Audio-Visual Speech Recognition Language Modelling +3

Paper
Add Code

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

no code implementations • 17 Nov 2022 • Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition

no code implementations • 10 Nov 2022 • Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, Jinyu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang

In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion.

Representation Learning Self-Supervised Learning +2

Paper
Add Code

Speech separation with large-scale self-supervised learning

no code implementations • 9 Nov 2022 • Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.

Self-Supervised Learning Speech Separation

Paper
Add Code

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

no code implementations • 7 Nov 2022 • Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphanso, Jinyu Li, Yifan Gong

Automatic Speech Recognition (ASR) systems typically yield output in lexical form.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

no code implementations • 5 Nov 2022 • Peidong Wang, Eric Sun, Jian Xue, Yu Wu, Long Zhou, Yashesh Gaur, Shujie Liu, Jinyu Li

In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

no code implementations • 4 Nov 2022 • Jian Xue, Peidong Wang, Jinyu Li, Eric Sun

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language.

Machine Translation speech-recognition +2

Paper
Add Code

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

1 code implementation • 31 Oct 2022 • Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei

However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.

Speech-to-Speech Translation Translation

1,005

Paper
Code

Simulating realistic speech overlaps improves multi-talker ASR

no code implementations • 27 Oct 2022 • Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

no code implementations • 16 Oct 2022 • Ruchao Fan, Guoli Ye, Yashesh Gaur, Jinyu Li

As a result, we reduce the WER of a streaming TT from 7. 6% to 6. 5% on the Librispeech test-other data and the CER from 7. 3% to 6. 1% on the Aishell test data, respectively.

Language Modelling speech-recognition +1

Paper
Add Code

CTCBERT: Advancing Hidden-unit BERT with CTC Objectives

no code implementations • 16 Oct 2022 • Ruchao Fan, Yiming Wang, Yashesh Gaur, Jinyu Li

We examine CTCBERT on IDs from HuBERT Iter1, HuBERT Iter2, and PBERT.

Paper
Add Code

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

1 code implementation • 7 Oct 2022 • Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, LiRong Dai, Jinyu Li, Furu Wei

The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

1,005

Paper
Code

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

1 code implementation • 30 Sep 2022 • Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, LiRong Dai, Jinyu Li, Furu Wei

In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.

Language Modelling speech-recognition +1

1,005

Paper
Code

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

no code implementations • 12 Sep 2022 • Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, Jinyu Li, Takuya Yoshioka

To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Tower Bridge Net (TB-Net): Bidirectional Knowledge Graph Aware Embedding Propagation for Explainable Recommender Systems

2 code implementations • International Conference on Data Engineering 2022 • Shendi Wang, Haoyang Li, Caleb Chen Cao, Xiao-Hui Li, Ng Ngai Fai, Jianxin Liu, Xun Xue, Hu Song, Jinyu Li, Guangye Gu, Lei Chen

Recently, neural networks based models have been widely used for recommender systems (RS).

Explainable Recommendation Recommendation Systems

Paper
Code

DeFlowSLAM: Self-Supervised Scene Motion Decomposition for Dynamic Dense SLAM

1 code implementation • 18 Jul 2022 • Weicai Ye, Xingyuan Yu, Xinyue Lan, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene.

Pose Estimation Simultaneous Localization and Mapping

109

Paper
Code

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

no code implementations • 21 Jun 2022 • Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu, Furu Wei

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

1 code implementation • 12 Jun 2022 • Ziqiang Zhang, Junyi Ao, Long Zhou, Shujie Liu, Furu Wei, Jinyu Li

The YiTrans system is built on large-scale pre-trained encoder-decoder models.

Data Augmentation Translation

1,005

Paper
Code

Ultra Fast Speech Separation Model with Teacher Student Learning

no code implementations • 27 Apr 2022 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).

Computational Efficiency Speech Separation

Paper
Add Code

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

no code implementations • 27 Apr 2022 • Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition.

Self-Supervised Learning Speaker Recognition +3

Paper
Add Code

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

1 code implementation • 11 Apr 2022 • Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur

Neural transducers have been widely used in automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Code

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei

In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

1,005

Paper
Code

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

1 code implementation • 30 Mar 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Code

Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems

1 code implementation • 2 Mar 2022 • Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Veljko Miljanic, Sheng Zhao, Hosam Khalil

In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

1 code implementation • 2 Feb 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Endpoint Detection for Streaming End-to-End Multi-talker ASR

no code implementations • 24 Jan 2022 • Liang Lu, Jinyu Li, Yifan Gong

Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.

Sentence speech-recognition +2

Paper
Add Code

Self-Supervised Learning for speech recognition with Intermediate layer supervision

1 code implementation • 16 Dec 2021 • Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.

Language Modelling Self-Supervised Learning +2

387

Paper
Code

Sequence-level self-learning with multiple hypotheses

no code implementations • 10 Dec 2021 • Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

For untranscribed speech data, the hypothesis from an ASR system must be used as a label.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Recent Advances in End-to-End Automatic Speech Recognition

no code implementations • 2 Nov 2021 • Jinyu Li

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

no code implementations • 28 Oct 2021 • Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie Liu, Takuya Yoshioka, Jinyu Li, DeLiang Wang

The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +8

Paper
Add Code

Continuous Speech Separation with Recurrent Selective Attention Network

no code implementations • 28 Oct 2021 • Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li

In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting.

speech-recognition Speech Recognition +1

Paper
Add Code

Separating Long-Form Speech with Group-Wise Permutation Invariant Training

no code implementations • 27 Oct 2021 • Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription.

Speech Separation

Paper
Add Code

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

5 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.

Denoising Self-Supervised Learning +3

18,274

Paper
Code

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

3 code implementations • ACL 2022 • Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

124,527

Paper
Code

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

3 code implementations • 12 Oct 2021 • Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu

We integrate the proposed methods into the HuBERT framework.

Data Augmentation Multi-Task Learning +5

387

Paper
Code

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

no code implementations • 11 Oct 2021 • Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu

In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

Paper
Add Code

Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition

no code implementations • 10 Oct 2021 • Guoli Ye, Vadim Mazalov, Jinyu Li, Yifan Gong

Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results.

speech-recognition Speech Recognition

Paper
Add Code

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

no code implementations • 6 Oct 2021 • Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong

ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Factorized Neural Transducer for Efficient Language Model Adaptation

1 code implementation • 27 Sep 2021 • Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li

In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

no code implementations • 17 Sep 2021 • Desh Raj, Liang Lu, Zhuo Chen, Yashesh Gaur, Jinyu Li

Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions.

Speech Separation

Paper
Add Code

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

no code implementations • 17 Aug 2021 • Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li

We incorporate the context information into the spelling correction model with a shared context encoder and use a filtering algorithm to handle large-size context lists.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A Configurable Multilingual Model is All You Need to Recognize All Languages

no code implementations • 13 Jul 2021 • Long Zhou, Jinyu Li, Eric Sun, Shujie Liu

Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

no code implementations • 12 Jul 2021 • Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei

Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.

Self-Supervised Learning speech-recognition +1

Paper
Add Code

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

no code implementations • 5 Jul 2021 • Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

no code implementations • 4 Jun 2021 • Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.

Language Modelling speech-recognition +1

Paper
Add Code

On Addressing Practical Challenges for RNN-Transducer

no code implementations • 27 Apr 2021 • Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.

speech-recognition Speech Recognition

Paper
Add Code

Streaming Multi-talker Speech Recognition with Joint Speaker Identification

no code implementations • 5 Apr 2021 • Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications.

Speaker Identification speech-recognition +2

Paper
Add Code

Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

no code implementations • 2 Feb 2021 • Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Streaming end-to-end multi-talker speech recognition

no code implementations • 26 Nov 2020 • Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions.

speech-recognition Speech Recognition

Paper
Add Code

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

no code implementations • 3 Nov 2020 • Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

no code implementations • 23 Oct 2020 • Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion.

Language Modelling speech-recognition +1

Paper
Add Code

Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

1 code implementation • 23 Oct 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li

With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.

Speech Separation

Paper
Code

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

no code implementations • 22 Oct 2020 • Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition.

speech-recognition Speech Recognition

Paper
Add Code

Speaker Separation Using Speaker Inventories and Estimated Speech

no code implementations • 20 Oct 2020 • Peidong Wang, Zhuo Chen, DeLiang Wang, Jinyu Li, Yifan Gong

We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation.

Speaker Separation Speech Extraction +2

Paper
Add Code

An End-to-end Architecture of Online Multi-channel Speech Separation

no code implementations • 7 Sep 2020 • Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie

Previously, we introduced a sys-tem, calledunmixing, fixed-beamformerandextraction(UFE), that was shown to be effective in addressing the speech over-lap problem in conversation transcription.

speech-recognition Speech Recognition +1

Paper
Add Code

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

1 code implementation • 14 Aug 2020 • Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation.

Data Augmentation Domain Adaptation +2

Paper
Code

Continuous Speech Separation with Conformer

1 code implementation • 13 Aug 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.

Ranked #1 on Speech Separation on LibriCSS (using extra training data)

Speech Separation

103

Paper
Code

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

no code implementations • 12 Aug 2020 • Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li

Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

no code implementations • 30 Jul 2020 • Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Production de la parole en r\'eponse \`a de multiples perturbations du feedback auditif (Speech production in response to multiple perturbations of auditory feedback)

no code implementations • JEPTALNRECITAL 2020 • Jinyu Li, Leonardo Lancia

Des {\'e}tudes ant{\'e}rieures ont montr{\'e} que la production de la parole d{\'e}pend des conditions du feedback auditif.

Paper
Add Code

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

1 code implementation • 28 May 2020 • Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

267

Paper
Code

Exploring Transformers for Large-Scale Speech Recognition

no code implementations • 19 May 2020 • Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong

While recurrent neural networks still largely define state-of-the-art speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition.

speech-recognition Speech Recognition

Paper
Add Code

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

no code implementations • 1 May 2020 • Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

L-Vector: Neural Label Embedding for Domain Adaptation

no code implementations • 25 Apr 2020 • Zhong Meng, Hu Hu, Jinyu Li, Changliang Liu, Yan Huang, Yifan Gong, Chin-Hui Lee

We propose a novel neural label embedding (NLE) scheme for the domain adaptation of a deep neural network (DNN) acoustic model with unpaired data samples from source and target domains.

Domain Adaptation

Paper
Add Code

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

no code implementations • 10 Apr 2020 • Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, Yifan Gong

This leads to an inevitable latency during inference.

Multi-Task Learning speech-recognition +1

Paper
Add Code

Low Latency End-to-End Streaming Speech Recognition with a Scout Network

no code implementations • 23 Mar 2020 • Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode.

Audio and Speech Processing

Paper
Add Code

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

no code implementations • 17 Mar 2020 • Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Continuous speech separation: dataset and analysis

1 code implementation • 30 Jan 2020 • Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li

In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

129

Paper
Code

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

no code implementations • 6 Jan 2020 • Zhong Meng, Jinyu Li, Yashesh Gaur, Yifan Gong

In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.

speech-recognition Speech Recognition +2

Paper
Add Code

Character-Aware Attention-Based End-to-End Speech Recognition

no code implementations • 6 Jan 2020 • Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion.

speech-recognition Speech Recognition

Paper
Add Code

Semantic Mask for Transformer based End-to-End Speech Recognition

1 code implementation • 6 Dec 2019 • Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

no code implementations • 9 Nov 2019 • Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Improving RNN Transducer Modeling for End-to-End Speech Recognition

1 code implementation • 26 Sep 2019 • Jinyu Li, Rui Zhao, Hu Hu, Yifan Gong

In this paper, we improve the RNN-T training in two aspects.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Adversarial Speaker Verification

no code implementations • 29 Apr 2019 • Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong

The use of deep networks to extract embeddings for speaker recognition has proven successfully.

General Classification Speaker Recognition +1

Paper
Add Code

Adversarial Speaker Adaptation

no code implementations • 29 Apr 2019 • Zhong Meng, Jinyu Li, Yifan Gong

We propose a novel adversarial speaker adaptation (ASA) scheme, in which adversarial learning is applied to regularize the distribution of deep hidden features in a speaker-dependent (SD) deep neural network (DNN) acoustic model to be close to that of a fixed speaker-independent (SI) DNN acoustic model during adaptation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Conditional Teacher-Student Learning

no code implementations • 28 Apr 2019 • Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong

To overcome this problem, we propose a conditional T/S learning scheme, in which a "smart" student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth.

Domain Adaptation Model Compression

Paper
Add Code

Attentive Adversarial Learning for Domain-Invariant Training

no code implementations • 28 Apr 2019 • Zhong Meng, Jinyu Li, Yifan Gong

Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Speaker Adaptation for End-to-End CTC Models

no code implementations • 4 Jan 2019 • Ke Li, Jinyu Li, Yong Zhao, Kshitiz Kumar, Yifan Gong

We propose two approaches for speaker adaptation in end-to-end (E2E) automatic speech recognition systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

no code implementations • 31 Dec 2018 • Amit Das, Jinyu Li, Guoli Ye, Rui Zhao, Yifan Gong

In particular, we introduce Attention CTC, Self-Attention CTC, Hybrid CTC, and Mixed-unit CTC.

Language Modelling

Paper
Add Code

Cycle-Consistent Speech Enhancement

no code implementations • 6 Sep 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this paper, we propose a cycle-consistent speech enhancement (CSE) in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones.

Multi-Task Learning Speech Enhancement

Paper
Add Code

Adversarial Feature-Mapping for Speech Enhancement

no code implementations • 6 Sep 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

To achieve better performance on ASR task, senone-aware (SA) AFM is further proposed in which an acoustic model network is jointly trained with the feature-mapping and discriminator networks to optimize the senone classification loss in addition to the AFM losses.

Speech Enhancement

Paper
Add Code

Layer Trajectory LSTM

no code implementations • 28 Aug 2018 • Jinyu Li, Changliang Liu, Yifan Gong

In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM.

Paper
Add Code

Recent Progresses in Deep Learning based Acoustic Models (Updated)

no code implementations • 25 Apr 2018 • Dong Yu, Jinyu Li

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.

General Classification Speech Enhancement +2

Paper
Add Code

Developing Far-Field Speaker System Via Teacher-Student Learning

no code implementations • 14 Apr 2018 • Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, Yifan Gong

In this study, we develop the keyword spotting (KWS) and acoustic model (AM) components in a far-field speaker system.

Keyword Spotting Model Compression

Paper
Add Code

Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation

no code implementations • 2 Apr 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

In this method, a student acoustic model and a condition classifier are jointly optimized to minimize the Kullback-Leibler divergence between the output distributions of the teacher and student models, and simultaneously, to min-maximize the condition classification loss.

Transfer Learning Unsupervised Domain Adaptation

Paper
Add Code

Speaker-Invariant Training via Adversarial Learning

no code implementations • 2 Apr 2018 • Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang, Juang

We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system.

General Classification Multi-Task Learning

Paper
Add Code

Advancing Connectionist Temporal Classification With Attention Modeling

no code implementations • 15 Mar 2018 • Amit Das, Jinyu Li, Rui Zhao, Yifan Gong

In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework.

Classification General Classification +3

Paper
Add Code

Advancing Acoustic-to-Word CTC Model

no code implementations • 15 Mar 2018 • Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, Yifan Gong

However, the word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.

Language Modelling

Paper
Add Code

Acoustic-To-Word Model Without OOV

no code implementations • 28 Nov 2017 • Jinyu Li, Guoli Ye, Rui Zhao, Jasha Droppo, Yifan Gong

However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.

Paper
Add Code

Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition

no code implementations • 21 Nov 2017 • Zhong Meng, Zhuo Chen, Vadim Mazalov, Jinyu Li, Yifan Gong

Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

Improved training for online end-to-end speech recognition systems

1 code implementation • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer, Jinyu Li, Rui Zhao

Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.

speech-recognition Speech Recognition

Paper
Code

Large-Scale Domain Adaptation via Teacher-Student Learning

no code implementations • 17 Aug 2017 • Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, Yifan Gong

High accuracy speech recognition requires a large amount of transcribed data for supervised training.

Domain Adaptation speech-recognition +1

Paper
Add Code

Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition

no code implementations • 21 Jul 2017 • Zhehuai Chen, Jasha Droppo, Jinyu Li, Wayne Xiong

We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

End-to-End Attention based Text-Dependent Speaker Verification

no code implementations • 3 Jan 2017 • Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Yifan Gong

A new type of End-to-End system for text-dependent speaker verification is presented in this paper.

Text-Dependent Speaker Verification

Paper
Add Code

Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

no code implementations • 12 Jan 2016 • Pawel Swietojanski, Jinyu Li, Steve Renals

This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data.

speech-recognition Speech Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.