Search Results for author: Haohan Guo

Found 13 papers, 7 papers with code

PodAgent: A Comprehensive Framework for Podcast Generation

1 code implementation1 Mar 2025 Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee

The key challenges lie in in-depth content generation, appropriate and expressive voice production.

Audio Generation Speech Synthesis

Audio-FLAN: A Preliminary Release

1 code implementation23 Feb 2025 Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e. g., transcription, comprehension) and generation (e. g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner.

Zero-Shot Learning

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

no code implementations25 Aug 2024 Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.

Text to Speech

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

no code implementations11 Jun 2024 Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, YuanJun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction.

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

no code implementations12 Feb 2024 Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.

Decoder Disentanglement +2

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

1 code implementation8 Jan 2024 Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng

To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition.

Decoder speech-recognition +1

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

1 code implementation31 Aug 2023 Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng

This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio.

Representation Learning Speech Representation Learning +4

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

1 code implementation27 Oct 2022 Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng

Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.

Transfer Learning

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

1 code implementation22 Sep 2022 Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively.

Triplet

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

1 code implementation3 Dec 2020 Haohan Guo, Heng Lu, Na Hu, Chunlei Zhang, Shan Yang, Lei Xie, Dan Su, Dong Yu

In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters.

Audio Generation Disentanglement +1

Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS

no code implementations9 Apr 2019 Haohan Guo, Frank K. Soong, Lei He, Lei Xie

The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS.

Sentence

A New GAN-based End-to-End TTS Training Algorithm

no code implementations9 Apr 2019 Haohan Guo, Frank K. Soong, Lei He, Lei Xie

However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data.

Generative Adversarial Network Sentence +1

Feature reinforcement with word embedding and parsing information in neural TTS

no code implementations3 Jan 2019 Huaiping Ming, Lei He, Haohan Guo, Frank K. Soong

In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework.

Sentence Text to Speech

Cannot find the paper you are looking for? You can Submit a new open access paper.