Audio Model Blocks


Introduced by Bińkowski et al. in High Fidelity Speech Synthesis with Adversarial Networks

GBlock is a type of residual block used in the GAN-TTS text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds to a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of $G$ is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term $z \sim N\left(0, \mathbf{I}_{128}\right)$ in the single-speaker case, or the concatenation of $z$ and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for each BatchNorm instance.

A GBlock contains two skip connections, the first of which in GAN-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 convolution if the number of output channels is different from the input.

Source: High Fidelity Speech Synthesis with Adversarial Networks


Paper Code Results Date Stars


Task Papers Share
Speech Synthesis 2 100.00%