Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components:
In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.
Source: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram PredictionsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Text to Speech | 15 | 30.00% |
Speech Synthesis | 14 | 28.00% |
Text-To-Speech Synthesis | 4 | 8.00% |
Decoder | 3 | 6.00% |
Voice Cloning | 2 | 4.00% |
Style Transfer | 2 | 4.00% |
Acoustic Modelling | 1 | 2.00% |
Voice Conversion | 1 | 2.00% |
Transliteration | 1 | 2.00% |