no code implementations • 3 Jul 2023 • Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee
Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency.
1 code implementation • 22 Sep 2022 • Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively.
no code implementations • 14 Sep 2022 • Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie
To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training.
1 code implementation • 19 Oct 2021 • Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
End-to-end TTS requires a large amount of speech/text paired data to cover all necessary knowledge, particularly how to pronounce different words in diverse contexts, so that a neural model may learn such knowledge accordingly.
no code implementations • 8 Jun 2021 • Liping Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He
Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech.
2 code implementations • 5 Mar 2021 • Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts.
1 code implementation • 18 Jul 2019 • Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jian-Hua Tao
Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0. 14 in a challenging test, and achieving close to human quality (4. 42 vs. 4. 49 in MOS) on general test.
no code implementations • 9 Apr 2019 • Haohan Guo, Frank K. Soong, Lei He, Lei Xie
The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS.
no code implementations • 9 Apr 2019 • Haohan Guo, Frank K. Soong, Lei He, Lei Xie
However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data.
no code implementations • 3 Jan 2019 • Huaiping Ming, Lei He, Haohan Guo, Frank K. Soong
In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework.
no code implementations • 1 Nov 2015 • Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e. g. speech utterances or handwritten documents.
4 code implementations • 21 Oct 2015 • Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e. g. speech utterances or handwritten documents.