WaveNet is an audio generative model based on the PixelCNN architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.
The joint probability of a waveform $\vec{x} = { x_1, \dots, x_T }$ is factorised as a product of conditional probabilities as follows:
$$p\left(\vec{x}\right) = \prod_{t=1}^{T} p\left(x_t \mid x_1, \dots ,x_{t-1}\right)$$
Each audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.
Source: WaveNet: A Generative Model for Raw AudioPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 53 | 22.18% |
Decoder | 18 | 7.53% |
Text-To-Speech Synthesis | 16 | 6.69% |
Voice Conversion | 12 | 5.02% |
Audio Generation | 8 | 3.35% |
Speech Enhancement | 6 | 2.51% |
Time Series Analysis | 5 | 2.09% |
Speech Recognition | 5 | 2.09% |
Translation | 5 | 2.09% |
Component | Type |
|
---|---|---|
Dilated Causal Convolution
|
Temporal Convolutions | |
Mixture of Logistic Distributions
|
Output Functions |