1 code implementation • 1 Mar 2025 • Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee
The key challenges lie in in-depth content generation, appropriate and expressive voice production.
1 code implementation • 23 Feb 2025 • Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e. g., transcription, comprehension) and generation (e. g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner.
no code implementations • 25 Aug 2024 • Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
no code implementations • 11 Jun 2024 • Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, YuanJun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction.
no code implementations • 12 Feb 2024 • Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
1 code implementation • 8 Jan 2024 • Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng
To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition.
1 code implementation • 31 Aug 2023 • Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio.
1 code implementation • 27 Oct 2022 • Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng
Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
1 code implementation • 22 Sep 2022 • Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively.
1 code implementation • 3 Dec 2020 • Haohan Guo, Heng Lu, Na Hu, Chunlei Zhang, Shan Yang, Lei Xie, Dan Su, Dong Yu
In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters.
no code implementations • 9 Apr 2019 • Haohan Guo, Frank K. Soong, Lei He, Lei Xie
The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS.
no code implementations • 9 Apr 2019 • Haohan Guo, Frank K. Soong, Lei He, Lei Xie
However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data.
no code implementations • 3 Jan 2019 • Huaiping Ming, Lei He, Haohan Guo, Frank K. Soong
In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework.