Text-to-Speech Models

FastSpeech 2s

Introduced by Ren et al. in FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2s is a text-to-speech model that abandons mel-spectrograms as intermediate output completely and directly generates speech waveform from text during inference. In other words there is no cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder). FastSpeech 2s generates waveform conditioning on intermediate hidden, which makes it more compact in inference by discarding the mel-spectrogram decoder.

Two main design changes are made to the waveform decoder.

First, considering that the phase information is difficult to predict using a variance predictor, adversarial training is used in the waveform decoder to force it to implicitly recover the phase information by itself.

Secondly, the mel-spectrogram decoder of FastSpeech 2 is leveraged, which is trained on the full text sequence to help on the text feature extraction. As shown in the Figure, the waveform decoder is based on the structure of WaveNet including non-causal convolutions and gated activation. The waveform decoder takes a sliced hidden sequence corresponding to a short audio clip as input and upsamples it with transposed 1D-convolution to match the length of audio clip. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN, which consists of ten layers of non-causal dilated 1-D convolutions with leaky ReLU activation function. The waveform decoder is optimized by the multi-resolution STFT loss and the LSGAN discriminator loss following Parallel WaveGAN.

In inference, the mel-spectrogram decoder is discarded and only the waveform decoder is used to synthesize speech audio.

Source: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech


Paper Code Results Date Stars


Task Papers Share
Knowledge Distillation 1 33.33%
Speech Synthesis 1 33.33%
Text-To-Speech Synthesis 1 33.33%