Text-to-Speech Models

Tacotron

Introduced by Wang et al. in Tacotron: Towards End-to-End Speech Synthesis

Tacotron is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram frames, which are then converted to waveforms.

Source: Tacotron: Towards End-to-End Speech Synthesis

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Speech Synthesis 43 38.39%
Text-To-Speech Synthesis 15 13.39%
Decoder 10 8.93%
Sentence 6 5.36%
Voice Cloning 5 4.46%
Voice Conversion 4 3.57%
Speech Recognition 4 3.57%
Expressive Speech Synthesis 3 2.68%
Self-Supervised Learning 2 1.79%

Categories