Tacotron

Introduced by Wang et al. in Tacotron: Towards End-to-End Speech Synthesis

Tacotron is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram frames, which are then converted to waveforms.

Source: Tacotron: Towards End-to-End Speech Synthesis

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	42	42.42%
Text-To-Speech Synthesis	15	15.15%
Sentence	6	6.06%
Voice Cloning	5	5.05%
Voice Conversion	4	4.04%
Speech Recognition	4	4.04%
Expressive Speech Synthesis	3	3.03%
Self-Supervised Learning	2	2.02%
Speaker Verification	2	2.02%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Additive Attention	Attention Mechanisms
CBHG	Speech Synthesis Blocks
Dense Connections	Feedforward Networks
Dropout	Regularization
Griffin-Lim Algorithm	Phase Reconstruction
GRU	Recurrent Neural Networks
ReLU	Activation Functions
Residual GRU	Recurrent Neural Networks

Categories

Add Remove

Text-to-Speech Models

Sequence To Sequence Models