Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

16 Dec 2017Jonathan ShenRuoming PangRon J. WeissMike SchusterNavdeep JaitlyZongheng YangZhifeng ChenYu ZhangYuxuan WangRJ Skerry-RyanRif A. SaurousYannis AgiomyrgiannakisYonghui Wu

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms... (read more)

Evaluation results from the paper

Task Dataset Model Metric name Metric value Global rank Compare
Speech Synthesis North American English Tacotron 2 Mean Opinion Score 4.526 # 1
Speech Synthesis North American English WaveNet (Linguistic) Mean Opinion Score 4.341 # 2