Browse > Speech > Speech Synthesis

Speech Synthesis

20 papers with code · Speech

Speech synthesis is the task of generating speech from text.

Please note that the state-of-the-art tables here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

State-of-the-art leaderboards

Greatest papers with code

WaveNet: A Generative Model for Raw Audio

12 Sep 2016buriburisuri/speech-to-text-wavenet

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio.

AUDIO GENERATION SPEECH SYNTHESIS

Tacotron: Towards End-to-End Speech Synthesis

29 Mar 2017keithito/tacotron

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices.

SPEECH SYNTHESIS TEXT-TO-SPEECH SYNTHESIS

WaveGlow: A Flow-based Generative Network for Speech Synthesis

31 Oct 2018NVIDIA/waveglow

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression.

SPEECH SYNTHESIS

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

ICLR 2018 r9y9/deepvoice3_pytorch

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.

SPEECH SYNTHESIS

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

16 Dec 2017NVIDIA/tacotron2

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.

SPEECH SYNTHESIS

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

25 May 2018NVIDIA/OpenSeq2Seq

We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x less training time.

MACHINE TRANSLATION SPEECH SYNTHESIS

Deep Voice: Real-time Neural Text-to-Speech

ICML 2017 NVIDIA/nv-wavenet

We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original.

BOUNDARY DETECTION SPEECH SYNTHESIS

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

23 Sep 2017r9y9/gantts

A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

SPEECH SYNTHESIS

LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

28 Oct 2018mozilla/LPCNet

These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS.

SPEECH SYNTHESIS

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

ICLR 2019 ksw0306/ClariNet

In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions.

SPEECH SYNTHESIS