Text-to-Speech Models

ClariNet is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform synthesizer (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the WaveNet module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on Deep Voice 3.

Source: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech


Paper Code Results Date Stars


Task Papers Share
Speech Synthesis 3 30.00%
Domain Adaptation 2 20.00%
Unsupervised Domain Adaptation 2 20.00%
Melody Extraction 1 10.00%
Retrieval 1 10.00%
Text-To-Speech Synthesis 1 10.00%