|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Clone a voice in 5 seconds to generate arbitrary speech in real-time
SOTA for Text-To-Speech Synthesis on LJSpeech (using extra training data)
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
#4 best model for Speech Synthesis on North American English
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units.
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons.
GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.