|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
#4 best model for Speech Synthesis on North American English
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units.
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
SOTA for Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric )
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons.
In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose.