no code implementations • 7 Oct 2024 • Zhuo Chen, Yichao Yan, Sehngqi Liu, Yuhao Cheng, Weiming Zhao, Lincheng Li, Mengxiao Bi, Xiaokang Yang
Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs.
no code implementations • 14 Sep 2024 • Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li
This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation.
no code implementations • 17 Jul 2024 • Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, Xiaokang Yang
Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3. 3K 4D HOI sequences and 4. 08M 3D HOI frames.
no code implementations • 12 Jun 2024 • Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi
With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 2 Apr 2024 • Shuai Tan, Bin Ji, Mengxiao Bi, Ye Pan
Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation.
no code implementations • 27 Sep 2023 • Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, Mengxiao Bi
Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality.
no code implementations • 21 Aug 2023 • Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi
Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example.
no code implementations • 21 May 2023 • Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie, Mengxiao Bi
Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities.
no code implementations • 9 Nov 2022 • Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue, Jixun Yao, Lei Xie, Mengxiao Bi
We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are adopted as the attention query, which result from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input.
no code implementations • 24 Nov 2021 • Zhichao Wang, Qicong Xie, Tao Li, Hongqiang Du, Lei Xie, Pengcheng Zhu, Mengxiao Bi
One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness.
no code implementations • 17 Oct 2021 • Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, Mengxiao Bi
In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score.
no code implementations • 26 Feb 2018 • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei, Zhijie Yan
The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody.