Speech Synthesis

174 papers with code • 3 benchmarks • 28 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )


Use these libraries to find Speech Synthesis models and implementations

Most implemented papers

WaveNet: A Generative Model for Raw Audio

ibab/tensorflow-wavenet 12 Sep 2016

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

Tacotron: Towards End-to-End Speech Synthesis

CorentinJ/Real-Time-Voice-Cloning 29 Mar 2017

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

coqui-ai/TTS 16 Dec 2017

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

PaddlePaddle/PaddleSpeech ICLR 2021

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

descriptinc/melgan-neurips NeurIPS 2019

In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques.

FastSpeech: Fast, Robust and Controllable Text to Speech

coqui-ai/TTS NeurIPS 2019

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Efficient Neural Audio Synthesis

CorentinJ/Real-Time-Voice-Cloning ICML 2018

The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

coqui-ai/TTS 25 Oct 2019

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

PaddlePaddle/PaddleSpeech ICML 2018

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.