Converting written text in natural language to speech.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #4 on Speech Synthesis on North American English
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units.
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Ranked #2 on Text-To-Speech Synthesis on LJSpeech
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).
Ranked #1 on Text-To-Speech Synthesis on LJSpeech
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric)
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.