WaveVAE

Introduced by Peng et al. in Non-Autoregressive Neural Text-to-Speech

WaveVAE is a generative audio model that can be used as a vocoder in text-to-speech systems. It is a VAE based model that can be trained from scratch by jointly optimizing the encoder $q_{\phi}\left(\mathbf{z}|\mathbf{x}, \mathbf{c}\right)$ and decoder $p_{\theta}\left(\mathbf{x}|\mathbf{z}, \mathbf{c}\right)$, where $\mathbf{z}$ is latent variables and $\mathbf{c}$ is the mel spectrogram conditioner.

The encoder of WaveVAE $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ is parameterized by a Gaussian autoregressive WaveNet that maps the ground truth audio x into the same length latent representation $\mathbf{z}$. The decoder $p_{\theta}\left(\mathbf{x}|\mathbf{z}\right)$ is parameterized by the one-step ahead predictions from an inverse autoregressive flow.

The training objective is the ELBO for the observed $\mathbf{x}$ in the VAE.

Source: Non-Autoregressive Neural Text-to-Speech

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Text-To-Speech Synthesis	1	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Normalizing Flows	Distribution Approximation
WaveNet	Generative Audio Models

Categories

Add Remove

Generative Audio Models