Tacotron is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram frames, which are then converted to waveforms.
Source: Tacotron: Towards End-to-End Speech SynthesisPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 42 | 42.42% |
Text-To-Speech Synthesis | 15 | 15.15% |
Sentence | 6 | 6.06% |
Voice Cloning | 5 | 5.05% |
Voice Conversion | 4 | 4.04% |
Speech Recognition | 4 | 4.04% |
Expressive Speech Synthesis | 3 | 3.03% |
Self-Supervised Learning | 2 | 2.02% |
Speaker Verification | 2 | 2.02% |