Audio Model Blocks

DV3 Convolution Block

Introduced by Ping et al. in Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

DV3 Convolution Block is a convolutional block used for the Deep Voice 3 text-to-speech architecture. It consists of a 1-D convolution with a gated linear unit and a residual connection. In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.

Source: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning


Paper Code Results Date Stars


Task Papers Share
Speech Synthesis 4 30.77%
Domain Adaptation 2 15.38%
Unsupervised Domain Adaptation 2 15.38%
Test 2 15.38%
Melody Extraction 1 7.69%
Retrieval 1 7.69%
Text-To-Speech Synthesis 1 7.69%