Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #4 on Speech Synthesis on North American English
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems.
With these data, three neural TTS models -- Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS.
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.
Ranked #1 on Speech Synthesis on North American English
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding.
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.