Speech synthesis is the task of generating speech from text.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Jejueo was classified as critically endangered by UNESCO in 2010.
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques.
However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech.
In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously.
We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text).
A generative model that can synthesise multiple prosodies will, by design, not model average prosody.