Text-to-Speech Models


Introduced by Wang et al. in Tacotron: Towards End-to-End Speech Synthesis

Tacotron is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram frames, which are then converted to waveforms.

Source: Tacotron: Towards End-to-End Speech Synthesis


Paper Code Results Date Stars