Objective quality estimation of a speech sample.
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Ranked #2 on Text-To-Speech Synthesis on LJSpeech
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric)
In this paper, we use self-supervised pre-trained models for MOS prediction.
In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.
MMSE approaches utilising the proposed a priori SNR estimator are able to achieve higher enhanced speech quality and intelligibility scores than recent masking- and mapping-based deep learning approaches.
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
The network training is independent of the number and the geometric configuration of the microphones.