Generative Audio Models

WaveVAE

Introduced by Peng et al. in Non-Autoregressive Neural Text-to-Speech

WaveVAE is a generative audio model that can be used as a vocoder in text-to-speech systems. It is a VAE based model that can be trained from scratch by jointly optimizing the encoder $q_{\phi}\left(\mathbf{z}|\mathbf{x}, \mathbf{c}\right)$ and decoder $p_{\theta}\left(\mathbf{x}|\mathbf{z}, \mathbf{c}\right)$, where $\mathbf{z}$ is latent variables and $\mathbf{c}$ is the mel spectrogram conditioner.

The encoder of WaveVAE $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ is parameterized by a Gaussian autoregressive WaveNet that maps the ground truth audio x into the same length latent representation $\mathbf{z}$. The decoder $p_{\theta}\left(\mathbf{x}|\mathbf{z}\right)$ is parameterized by the one-step ahead predictions from an inverse autoregressive flow.

The training objective is the ELBO for the observed $\mathbf{x}$ in the VAE.

Source: Non-Autoregressive Neural Text-to-Speech

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Text-To-Speech Synthesis 1 100.00%

Categories