SpecGAN is a generative adversarial network method for spectrogram-based, frequency-domain audio generation. The problem is suited for GANs designed for image generation. The model can be approximately inverted.
To process audio into suitable spectrograms, the authors perform the short-time Fourier transform with 16 ms windows and 8ms stride, resulting in 128 frequency bins, linearly spaced from 0 to 8 kHz. They take the magnitude of the resultant spectra and scale amplitude values logarithmically to better-align with human perception. They then normalize each frequency bin to have zero mean and unit variance. They clip the spectra to $3$ standard deviations and rescale to $\left[−1, 1\right]$.
They then use the DCGAN approach on the result spectra.
Source: Adversarial Audio SynthesisPaper | Code | Results | Date | Stars |
---|
Component | Type |
|
---|---|---|
DCGAN
|
Generative Models | |
Griffin-Lim Algorithm
|
Phase Reconstruction | |
Phase Shuffle
|
Audio Artifact Removal | |
Tanh Activation
|
Activation Functions | |
WGAN-GP Loss
|
Loss Functions |