ParaNet is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to Deep Voice 3 except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a causal convolution block followed by an attention block, ParaNet has a single attention block in the encoder.
Source: Non-Autoregressive Neural Text-to-SpeechPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Text to Speech | 2 | 40.00% |
GPR | 1 | 20.00% |
point cloud upsampling | 1 | 20.00% |
Text-To-Speech Synthesis | 1 | 20.00% |