Speech synthesis is the task of generating speech from text.
Please note that the state-of-the-art tables here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training.
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
#4 best model for Speech Synthesis on North American English
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system.
We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training.
We demonstrate that LPCNet operating at 1. 6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate.
We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS.