ClariNet

Introduced by Ping et al. in ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

ClariNet is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform synthesizer (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the WaveNet module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on Deep Voice 3.

Source: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	3	30.00%
Domain Adaptation	2	20.00%
Unsupervised Domain Adaptation	2	20.00%
Melody Extraction	1	10.00%
Retrieval	1	10.00%
Text-To-Speech Synthesis	1	10.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Bridge-net	Audio Model Blocks
Dense Connections	Feedforward Networks
DV3 Attention Block	Audio Model Blocks
DV3 Convolution Block	Audio Model Blocks
L1 Regularization	Regularization
Leaky ReLU	Activation Functions
Normalizing Flows	Distribution Approximation
ReLU	Activation Functions
Softsign Activation	Activation Functions
WaveNet	Generative Audio Models
Weight Normalization	Normalization

Categories

Add Remove

Text-to-Speech Models

Sequence To Sequence Models