no code implementations • 11 Nov 2024 • Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue
End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years.
no code implementations • 8 Jun 2024 • Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time.
1 code implementation • 28 May 2024 • Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation.
no code implementations • 10 Apr 2024 • Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng
In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.
no code implementations • 25 Sep 2023 • Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian
In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE).
1 code implementation • 17 Jul 2023 • Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng
Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions.
no code implementations • 30 May 2023 • Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • NeurIPS 2023 • Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language.
no code implementations • 23 May 2023 • Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, ZiYi Yang, Reid Pryzant, Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng, Xuedong Huang
Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities.
no code implementations • 21 May 2023 • ZiYi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities.
no code implementations • 20 Mar 2023 • Haibin Yu, Yuxuan Hu, Yao Qian, Ma Jin, Linquan Liu, Shujie Liu, Yu Shi, Yanmin Qian, Edward Lin, Michael Zeng
Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 15 Mar 2023 • Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng
Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.
no code implementations • 17 May 2022 • Mostafa Karimi, Changliang Liu, Kenichi Kumatani, Yao Qian, Tianyu Wu, Jian Wu
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 3 May 2022 • ZiYi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview.
1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei
In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
1 code implementation • 16 Dec 2021 • Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang
Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.
no code implementations • 28 Oct 2021 • Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie Liu, Takuya Yoshioka, Jinyu Li, DeLiang Wang
The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +9
9 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.
Ranked #1 on Speech Recognition on CALLHOME En
no code implementations • 15 Oct 2021 • Rimita Lahiri, Kenichi Kumatani, Eric Sun, Yao Qian
Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
6 code implementations • ACL 2022 • Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +8
5 code implementations • 12 Oct 2021 • Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu
We integrate the proposed methods into the HuBERT framework.
no code implementations • 11 Oct 2021 • Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu
In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +7
no code implementations • 12 Jul 2021 • Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei
Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.
no code implementations • 11 Feb 2021 • Yao Qian, Ximo Bian, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao, Michael Zeng
End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module.
Ranked #3 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)
5 code implementations • 19 Jan 2021 • Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.
no code implementations • 27 Nov 2019 • Chee Wee Leong, Katrina Roohr, Vikram Ramanarayanan, Michelle P. Martin-Raugh, Harrison Kell, Rutuja Ubale, Yao Qian, Zydrune Mladineo, Laura McCulla
Supervised systems require human labels for training.
no code implementations • WS 2017 • Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, Yao Qian
We believe this makes for a more interesting shared task while building on the methods and results from the previous two shared tasks.
no code implementations • 1 Nov 2015 • Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e. g. speech utterances or handwritten documents.
4 code implementations • 21 Oct 2015 • Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e. g. speech utterances or handwritten documents.