MelGAN

Introduced by Kumar et al. in MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

MelGAN is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a GAN setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram $s$ as input and raw waveform $x$ as output. Since the mel-spectrogram is at a 256× lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input.

To deal with 'checkerboard artifacts' in audio, instead of using PhaseShuffle, MelGAN uses kernel-size as a multiple of stride.

Weight normalization is used for normalization. A window-based discriminator, similar to a PatchGAN is used for the discriminator.

Source: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	7	53.85%
BIG-bench Machine Learning	1	7.69%
Music Generation	1	7.69%
Face Swapping	1	7.69%
Spectral Reconstruction	1	7.69%
Speech Enhancement	1	7.69%
Translation	1	7.69%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Average Pooling	Pooling Operations
Convolution	Convolutions
GAN Feature Matching	Regularization
GAN Hinge Loss	Loss Functions
Leaky ReLU	Activation Functions
MelGAN Residual Block	Skip Connection Blocks
Tanh Activation	Activation Functions
Weight Normalization	Normalization
Window-based Discriminator	Discriminators

Categories

Add Remove

Generative Audio Models