Deep Voice 3

Introduced by Ping et al. in Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Deep Voice 3 (DV3) is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:

Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation.
Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.
Converter: A fully-convolutional post-processing network, which predicts final vocoder parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.

Source: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Text-To-Speech Synthesis	1	50.00%
Speech Synthesis	1	50.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
DV3 Attention Block	Audio Model Blocks
DV3 Convolution Block	Audio Model Blocks
Griffin-Lim Algorithm	Phase Reconstruction	(optional)
L1 Regularization	Regularization
ReLU	Activation Functions
Softsign Activation	Activation Functions
WaveNet	Generative Audio Models	(optional)
Weight Normalization	Normalization

Categories

Add Remove

Text-to-Speech Models

Sequence To Sequence Models