no code implementations • 10 Jan 2025 • Shubham Bansal, Vikas Joshi, Harveen Chadha, Rupeshkumar Mehta, Jinyu Li
This study addresses the issue of speaker gender bias in Speech Translation (ST) systems, which can lead to offensive and inaccurate translations.
no code implementations • 20 Dec 2024 • Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen
Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality.
no code implementations • 20 Dec 2024 • Yifan Yang, Ziyang Ma, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ruiyang Xu, Yuxuan Hu, Yan Lu, Rui Zhao, Xie Chen
This paper introduces Interleaved Speech-Text Language Model (IST-LM) for streaming zero-shot Text-to-Speech (TTS).
no code implementations • 2 Dec 2024 • Ruchao Fan, Bo Ren, Yuxuan Hu, Rui Zhao, Shujie Liu, Jinyu Li
The AlignFormer can achieve a near 100% IFR with audio-first training and game-changing improvements from zero to non-zero IFR on some evaluation data with instruction-first training.
no code implementations • 29 Nov 2024 • Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.
1 code implementation • 27 Nov 2024 • Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li
Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models.
no code implementations • 27 Oct 2024 • Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, Furu Wei
Text-to-video models have recently undergone rapid and substantial advancements.
no code implementations • 17 Oct 2024 • Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 7 Oct 2024 • Rui Zhao, Jinyu Li, Ruchao Fan, Matt Post
Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language.
no code implementations • 20 Sep 2024 • Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan
In this work, we propose a new approach to estimate word boundaries without relying on lexicons.
1 code implementation • 17 Jul 2024 • Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda
This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker.
no code implementations • 11 Jul 2024 • Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS).
no code implementations • 12 Jun 2024 • Peidong Wang, Jian Xue, Jinyu Li, Junkun Chen, Aswin Shanmugam Subramanian
Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language.
no code implementations • 12 Jun 2024 • Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei
With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis.
no code implementations • 9 Jun 2024 • Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda
Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements.
no code implementations • 8 Jun 2024 • Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time.
no code implementations • 6 Jun 2024 • Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda
We also show that the proposed MaskGIT-based model can generate phoneme durations with higher quality and diversity compared to its regression or flow-matching counterparts.
1 code implementation • 28 May 2024 • Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation.
1 code implementation • 10 Apr 2024 • Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng
In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.
no code implementations • 4 Apr 2024 • Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao
Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
1 code implementation • 31 Mar 2024 • Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach.
no code implementations • 5 Mar 2024 • Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.
no code implementations • 30 Dec 2023 • Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei
In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder.
no code implementations • 3 Nov 2023 • Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li
We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +8
1 code implementation • 23 Oct 2023 • Jinyu Li, Xiaokun Pan, Gan Huang, Ziyang Zhang, Nan Wang, Hujun Bao, Guofeng Zhang
In this work, we design a novel visual-inertial odometry (VIO) system called RD-VIO to handle both of these two problems.
no code implementations • 23 Oct 2023 • Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li, Yashesh Gaur
The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 16 Oct 2023 • Jinyu Li
Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing.
no code implementations • 6 Oct 2023 • Junkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li
Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication.
no code implementations • 3 Oct 2023 • Yiming Wang, Jinyu Li
In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure.
no code implementations • 15 Sep 2023 • Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 14 Sep 2023 • Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka
End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion.
no code implementations • 14 Aug 2023 • Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech.
no code implementations • 30 Jul 2023 • Eric Sun, Jinyu Li, Jian Xue, Yifan Gong
When mixing 20, 000 hours augmented speech data generated by our method with 12, 500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8. 7% relative word error rate reduction.
no code implementations • 8 Jul 2023 • Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language.
no code implementations • 7 Jul 2023 • Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 28 Jun 2023 • Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM).
no code implementations • 28 Jun 2023 • Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 31 May 2023 • Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu
In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 25 May 2023 • Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.
1 code implementation • CVPR 2023 • Jinyu Li, Chenxu Luo, Xiaodong Yang
In order to deal with the sparse and unstructured raw point clouds, LiDAR based 3D object detection research mostly focuses on designing dedicated local point aggregators for fine-grained geometrical modeling.
Ranked #1 on 3D Object Detection on waymo vehicle
1 code implementation • 7 Mar 2023 • Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.
no code implementations • 1 Mar 2023 • Eric Sun, Jinyu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin, Yifan Gong
We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference.
no code implementations • 22 Feb 2023 • Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Sheng Zhao
To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 16 Feb 2023 • Jian Wu, Zhuo Chen, Min Hu, Xiong Xiao, Jinyu Li
Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
7 code implementations • 5 Jan 2023 • Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
no code implementations • 5 Dec 2022 • Rui Zhao, Jian Xue, Partha Parthasarathy, Veljko Miljanic, Jinyu Li
Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability.
no code implementations • 21 Nov 2022 • Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.
no code implementations • 17 Nov 2022 • Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian
This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 10 Nov 2022 • Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, Jinyu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang
In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion.
no code implementations • 9 Nov 2022 • Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez
Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.
no code implementations • 7 Nov 2022 • Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphanso, Jinyu Li, Yifan Gong
Automatic Speech Recognition (ASR) systems typically yield output in lexical form.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 5 Nov 2022 • Peidong Wang, Eric Sun, Jian Xue, Yu Wu, Long Zhou, Yashesh Gaur, Shujie Liu, Jinyu Li
In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 4 Nov 2022 • Jian Xue, Peidong Wang, Jinyu Li, Eric Sun
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language.
1 code implementation • 31 Oct 2022 • Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei
However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
no code implementations • 27 Oct 2022 • Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka
Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 16 Oct 2022 • Ruchao Fan, Guoli Ye, Yashesh Gaur, Jinyu Li
As a result, we reduce the WER of a streaming TT from 7. 6% to 6. 5% on the Librispeech test-other data and the CER from 7. 3% to 6. 1% on the Aishell test data, respectively.
no code implementations • 16 Oct 2022 • Ruchao Fan, Yiming Wang, Yashesh Gaur, Jinyu Li
We examine CTCBERT on IDs from HuBERT Iter1, HuBERT Iter2, and PBERT.
2 code implementations • 7 Oct 2022 • Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, LiRong Dai, Jinyu Li, Furu Wei
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 30 Sep 2022 • Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, LiRong Dai, Jinyu Li, Furu Wei
In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.
no code implementations • 12 Sep 2022 • Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, Jinyu Li, Takuya Yoshioka
To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
2 code implementations • International Conference on Data Engineering 2022 • Shendi Wang, Haoyang Li, Caleb Chen Cao, Xiao-Hui Li, Ng Ngai Fai, Jianxin Liu, Xun Xue, Hu Song, Jinyu Li, Guangye Gu, Lei Chen
Recently, neural networks based models have been widely used for recommender systems (RS).
1 code implementation • 18 Jul 2022 • Xingyuan Yu, Weicai Ye, Xiyue Guo, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang
We propose a dynamic update module based on this representation and develop a dense SLAM system that excels in dynamic scenarios.
no code implementations • 21 Jun 2022 • Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu, Furu Wei
Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 12 Jun 2022 • Ziqiang Zhang, Junyi Ao, Long Zhou, Shujie Liu, Furu Wei, Jinyu Li
The YiTrans system is built on large-scale pre-trained encoder-decoder models.
no code implementations • 27 Apr 2022 • Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition.
no code implementations • 27 Apr 2022 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu
In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).
1 code implementation • 11 Apr 2022 • Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur
Neural transducers have been widely used in automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei
In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
1 code implementation • 30 Mar 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
1 code implementation • 2 Mar 2022 • Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Veljko Miljanic, Sheng Zhao, Hosam Khalil
In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 2 Feb 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 24 Jan 2022 • Liang Lu, Jinyu Li, Yifan Gong
Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.
1 code implementation • 16 Dec 2021 • Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang
Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.
no code implementations • 10 Dec 2021 • Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng
For untranscribed speech data, the hypothesis from an ASR system must be used as a label.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 2 Nov 2021 • Jinyu Li
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 28 Oct 2021 • Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie Liu, Takuya Yoshioka, Jinyu Li, DeLiang Wang
The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +9
no code implementations • 28 Oct 2021 • Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li
In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting.
no code implementations • 27 Oct 2021 • Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei
Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription.
9 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.
Ranked #1 on Speech Recognition on CALLHOME En
6 code implementations • ACL 2022 • Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +8
5 code implementations • 12 Oct 2021 • Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu
We integrate the proposed methods into the HuBERT framework.
no code implementations • 11 Oct 2021 • Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu
In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +7
no code implementations • 10 Oct 2021 • Guoli Ye, Vadim Mazalov, Jinyu Li, Yifan Gong
Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results.
no code implementations • 6 Oct 2021 • Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 27 Sep 2021 • Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li
In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 17 Sep 2021 • Desh Raj, Liang Lu, Zhuo Chen, Yashesh Gaur, Jinyu Li
Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions.
no code implementations • 17 Aug 2021 • Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li
We incorporate the context information into the spelling correction model with a shared context encoder and use a filtering algorithm to handle large-size context lists.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 13 Jul 2021 • Long Zhou, Jinyu Li, Eric Sun, Shujie Liu
Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 12 Jul 2021 • Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei
Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.
no code implementations • 5 Jul 2021 • Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li
Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 4 Jun 2021 • Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong
In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.
no code implementations • 27 Apr 2021 • Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong
The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.
no code implementations • 5 Apr 2021 • Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong
In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications.
no code implementations • 2 Feb 2021 • Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong
The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 26 Nov 2020 • Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong
End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions.
no code implementations • 3 Nov 2020 • Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 23 Oct 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li
With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.
no code implementations • 23 Oct 2020 • Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong
Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion.
no code implementations • 22 Oct 2020 • Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition.
no code implementations • 20 Oct 2020 • Peidong Wang, Zhuo Chen, DeLiang Wang, Jinyu Li, Yifan Gong
We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation.
no code implementations • 7 Sep 2020 • Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie
Previously, we introduced a sys-tem, calledunmixing, fixed-beamformerandextraction(UFE), that was shown to be effective in addressing the speech over-lap problem in conversation transcription.
1 code implementation • 14 Aug 2020 • Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski
We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation.
1 code implementation • 13 Aug 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.
Ranked #1 on Speech Separation on LibriCSS (using extra training data)
no code implementations • 12 Aug 2020 • Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li
Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 30 Jul 2020 • Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • JEPTALNRECITAL 2020 • Jinyu Li, Leonardo Lancia
Des {\'e}tudes ant{\'e}rieures ont montr{\'e} que la production de la parole d{\'e}pend des conditions du feedback auditif.
1 code implementation • 28 May 2020 • Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu
Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 19 May 2020 • Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong
While recurrent neural networks still largely define state-of-the-art speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition.
no code implementations • 1 May 2020 • Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong
Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 25 Apr 2020 • Zhong Meng, Hu Hu, Jinyu Li, Changliang Liu, Yan Huang, Yifan Gong, Chin-Hui Lee
We propose a novel neural label embedding (NLE) scheme for the domain adaptation of a deep neural network (DNN) acoustic model with unpaired data samples from source and target domains.
no code implementations • 10 Apr 2020 • Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, Yifan Gong
This leads to an inevitable latency during inference.
no code implementations • 23 Mar 2020 • Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou
The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode.
Audio and Speech Processing
no code implementations • 17 Mar 2020 • Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong
While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 30 Jan 2020 • Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li
In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 6 Jan 2020 • Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion.
no code implementations • 6 Jan 2020 • Zhong Meng, Jinyu Li, Yashesh Gaur, Yifan Gong
In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.
1 code implementation • 6 Dec 2019 • Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou
Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 9 Nov 2019 • Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 26 Sep 2019 • Jinyu Li, Rui Zhao, Hu Hu, Yifan Gong
In this paper, we improve the RNN-T training in two aspects.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 29 Apr 2019 • Zhong Meng, Jinyu Li, Yifan Gong
We propose a novel adversarial speaker adaptation (ASA) scheme, in which adversarial learning is applied to regularize the distribution of deep hidden features in a speaker-dependent (SD) deep neural network (DNN) acoustic model to be close to that of a fixed speaker-independent (SI) DNN acoustic model during adaptation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 29 Apr 2019 • Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong
The use of deep networks to extract embeddings for speaker recognition has proven successfully.
no code implementations • 28 Apr 2019 • Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong
To overcome this problem, we propose a conditional T/S learning scheme, in which a "smart" student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth.
no code implementations • 28 Apr 2019 • Zhong Meng, Jinyu Li, Yifan Gong
Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 4 Jan 2019 • Ke Li, Jinyu Li, Yong Zhao, Kshitiz Kumar, Yifan Gong
We propose two approaches for speaker adaptation in end-to-end (E2E) automatic speech recognition systems.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 31 Dec 2018 • Amit Das, Jinyu Li, Guoli Ye, Rui Zhao, Yifan Gong
In particular, we introduce Attention CTC, Self-Attention CTC, Hybrid CTC, and Mixed-unit CTC.
no code implementations • 6 Sep 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang
In this paper, we propose a cycle-consistent speech enhancement (CSE) in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones.
no code implementations • 6 Sep 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang
To achieve better performance on ASR task, senone-aware (SA) AFM is further proposed in which an acoustic model network is jointly trained with the feature-mapping and discriminator networks to optimize the senone classification loss in addition to the AFM losses.
no code implementations • 28 Aug 2018 • Jinyu Li, Changliang Liu, Yifan Gong
In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM.
no code implementations • 25 Apr 2018 • Dong Yu, Jinyu Li
In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.
no code implementations • 14 Apr 2018 • Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, Yifan Gong
In this study, we develop the keyword spotting (KWS) and acoustic model (AM) components in a far-field speaker system.
no code implementations • 2 Apr 2018 • Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong, Biing-Hwang, Juang
We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system.
no code implementations • 2 Apr 2018 • Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang
In this method, a student acoustic model and a condition classifier are jointly optimized to minimize the Kullback-Leibler divergence between the output distributions of the teacher and student models, and simultaneously, to min-maximize the condition classification loss.
no code implementations • 15 Mar 2018 • Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework.
no code implementations • 15 Mar 2018 • Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, Yifan Gong
However, the word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.
no code implementations • 28 Nov 2017 • Jinyu Li, Guoli Ye, Rui Zhao, Jasha Droppo, Yifan Gong
However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node.
no code implementations • 21 Nov 2017 • Zhong Meng, Zhuo Chen, Vadim Mazalov, Jinyu Li, Yifan Gong
Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
1 code implementation • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer, Jinyu Li, Rui Zhao
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.
no code implementations • 17 Aug 2017 • Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, Yifan Gong
High accuracy speech recognition requires a large amount of transcribed data for supervised training.
no code implementations • 21 Jul 2017 • Zhehuai Chen, Jasha Droppo, Jinyu Li, Wayne Xiong
We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 3 Jan 2017 • Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Yifan Gong
A new type of End-to-End system for text-dependent speaker verification is presented in this paper.
no code implementations • 12 Jan 2016 • Pawel Swietojanski, Jinyu Li, Steve Renals
This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data.