Generative Audio Models


Introduced by Oord et al. in WaveNet: A Generative Model for Raw Audio

WaveNet is an audio generative model based on the PixelCNN architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.

The joint probability of a waveform $\vec{x} = { x_1, \dots, x_T }$ is factorised as a product of conditional probabilities as follows:

$$p\left(\vec{x}\right) = \prod_{t=1}^{T} p\left(x_t \mid x_1, \dots ,x_{t-1}\right)$$

Each audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.

Source: WaveNet: A Generative Model for Raw Audio


Paper Code Results Date Stars


Task Papers Share
Speech Synthesis 49 29.17%
Text-To-Speech Synthesis 15 8.93%
Voice Conversion 11 6.55%
Audio Generation 7 4.17%
Speech Enhancement 6 3.57%
Time Series 5 2.98%
Speech Recognition 5 2.98%
Music Generation 4 2.38%
Machine Translation 4 2.38%