Text-to-Speech Models

Deep Voice 3 (DV3) is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:

  • Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation.

  • Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.

  • Converter: A fully-convolutional post-processing network, which predicts final vocoder parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.

Source: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning


Paper Code Results Date Stars


Task Papers Share
Test 2 50.00%
Text-To-Speech Synthesis 1 25.00%
Speech Synthesis 1 25.00%