no code implementations • 2 Jan 2025 • Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng
Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker.
no code implementations • 2 Jan 2025 • Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng
The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output.
no code implementations • 28 Aug 2024 • John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu
The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model.
1 code implementation • 8 Mar 2024 • Peng Liu, Dongyang Dai, Zhiyong Wu
Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations.
Ranked #5 on
Speech Synthesis
on LibriTTS
no code implementations • 7 Oct 2021 • Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang
(2) How to clone a person's voice while controlling the style and prosody.
no code implementations • 21 Dec 2020 • Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, Helen Meng
By using deep learning approaches, Speech Emotion Recog-nition (SER) on a single domain has achieved many excellentresults.
no code implementations • 20 Jun 2020 • Huirong Huang, Zhiyong Wu, Shiyin Kang, Dongyang Dai, Jia Jia, Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng
Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input.
no code implementations • 26 May 2020 • Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang
Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.