Speech synthesis is the task of generating speech from text.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
This paper investigates the differences occuring in the excitation for different voice qualities.
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices.
However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech.
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods.
We present a novel generative model that combines state-of-the-art neural text- to-speech (TTS) with semi-supervised probabilistic latent variable models.
This paper introduces attention forcing, which guides the model with generated output history and reference attention.
The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input.
For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual.
The applicability of the DSM in two fields of speech processing is then studied.