no code implementations • 9 Apr 2024 • Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, HUI ZHANG, Xie Chen, Kai Yu
Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 25 Jan 2024 • Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, HUI ZHANG, Xie Chen, Kai Yu
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt.
no code implementations • 3 Nov 2023 • Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu
Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.
1 code implementation • 14 Sep 2023 • Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques.
no code implementations • 10 Sep 2023 • Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu
Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency.
no code implementations • 25 Jun 2023 • Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i. e. speaker similarity) and eliminate the accents from their first language(i. e. nativeness).
no code implementations • 14 Jun 2023 • Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen
Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 30 Mar 2023 • Chenpeng Du, Qi Chen, Xie Chen, Kai Yu
Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly.
no code implementations • 17 Nov 2022 • Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively.
no code implementations • 2 Apr 2022 • Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature.
no code implementations • 15 Feb 2022 • Yiwei Guo, Chenpeng Du, Kai Yu
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference.
2 code implementations • 1 Feb 2021 • Chenpeng Du, Kai Yu
Generating natural speech with diverse and smooth prosody pattern is a challenging task.
no code implementations • 4 Nov 2020 • Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, Yanmin Qian
Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4