Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

16 Dec 2017Jonathan ShenRuoming PangRon J. WeissMike SchusterNavdeep JaitlyZongheng YangZhifeng ChenYu ZhangYuxuan WangRJ Skerry-RyanRif A. SaurousYannis AgiomyrgiannakisYonghui Wu

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Speech Synthesis North American English Tacotron 2 Mean Opinion Score 4.526 # 1
Speech Synthesis North American English WaveNet (Linguistic) Mean Opinion Score 4.341 # 2