ClariNet is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform synthesizer (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the WaveNet module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on Deep Voice 3.
Source: ClariNet: Parallel Wave Generation in End-to-End Text-to-SpeechPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 3 | 30.00% |
Domain Adaptation | 2 | 20.00% |
Unsupervised Domain Adaptation | 2 | 20.00% |
Melody Extraction | 1 | 10.00% |
Retrieval | 1 | 10.00% |
Text-To-Speech Synthesis | 1 | 10.00% |