GAN-TTS

Introduced by Bińkowski et al. in High Fidelity Speech Synthesis with Adversarial Networks

GAN-TTS is a generative adversarial network for text-to-speech synthesis. The architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyze the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced.

The generator architecture consists of several GBlocks, which are residual based (dilated) convolution blocks. GBlocks 3–7 gradually upsample the temporal dimension of hidden representations by factors of 2, 2, 2, 3, 5, while the number of channels is reduced by GBlocks 3, 6 and 7 (by a factor of 2 each). The final convolutional layer with Tanh activation produces a single-channel audio waveform.

Instead of a single discriminator, GAN-TTS uses an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways.

Source: High Fidelity Speech Synthesis with Adversarial Networks

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	2	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Convolution	Convolutions
GBlock	Audio Model Blocks
Multiple Random Window Discriminator	Discriminators
Off-Diagonal Orthogonal Regularization	Regularization
Orthogonal Regularization	Regularization
Spectral Normalization	Normalization
Tanh Activation	Activation Functions

Categories

Add Remove

Text-to-Speech Models

Sequence To Sequence Models