Generative Audio Models


Introduced by Donahue et al. in Adversarial Audio Synthesis

WaveGAN is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms).

The WaveGAN architecture is based off DCGAN. The DCGAN generator uses the transposed convolution operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed convolution operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride from 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical operations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:

  1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).
  2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).
  3. Removing batch normalization from the generator and discriminator.
  4. Training using the WGAN-GP strategy.
Source: Adversarial Audio Synthesis


Paper Code Results Date Stars