Speech Synthesis

286 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Libraries

Use these libraries to find Speech Synthesis models and implementations

Most implemented papers

DiffWave: A Versatile Diffusion Model for Audio Synthesis

lmnt-com/diffwave ICLR 2021

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

jik876/hifi-gan NeurIPS 2020) 2020

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

r9y9/deepvoice3_pytorch ICLR 2018

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system.

WaveGrad: Estimating Gradients for Waveform Generation

coqui-ai/TTS ICLR 2021

This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.

Neural Speech Synthesis with Transformer Network

PaddlePaddle/PaddleSpeech 19 Sep 2018

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

huawei-noah/Speech-Backbones 13 May 2021

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes.

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

coqui-ai/TTS 15 Jun 2021

Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

microsoft/unilm 5 Jan 2023

In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

Neural Autoregressive Flows

CW-Huang/NAF ICML 2018

Normalizing flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via Masked Autoregressive Flows (MAF), and to accelerate state-of-the-art WaveNet-based speech synthesis to 20x faster than real-time, via Inverse Autoregressive Flows (IAF).

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

ksw0306/ClariNet ICLR 2019

In this work, we propose a new solution for parallel wave generation by WaveNet.