Text-to-Speech Models


Introduced by Peng et al. in Non-Autoregressive Neural Text-to-Speech

ParaNet is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to Deep Voice 3 except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a causal convolution block followed by an attention block, ParaNet has a single attention block in the encoder.

Source: Non-Autoregressive Neural Text-to-Speech


Paper Code Results Date Stars


Task Papers Share
Test 2 50.00%
GPR 1 25.00%
Text-To-Speech Synthesis 1 25.00%