About

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Benchmarks

TREND DATASET BEST METHOD PAPER TITLE PAPER CODE COMPARE

Subtasks

Datasets

Greatest papers with code

Efficient Neural Audio Synthesis

ICML 2018 CorentinJ/Real-Time-Voice-Cloning

The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.

SPEECH SYNTHESIS TEXT-TO-SPEECH SYNTHESIS

Tacotron: Towards End-to-End Speech Synthesis

29 Mar 2017CorentinJ/Real-Time-Voice-Cloning

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

SPEECH SYNTHESIS TEXT-TO-SPEECH SYNTHESIS

A Spectral Energy Distance for Parallel Speech Synthesis

NeurIPS 2020 google-research/google-research

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems.

SPEECH SYNTHESIS

WaveNet: A Generative Model for Raw Audio

12 Sep 2016ibab/tensorflow-wavenet

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

AUDIO GENERATION SPEECH SYNTHESIS

Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

16 Oct 2020espnet/espnet

With these data, three neural TTS models -- Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS.

SPEECH SYNTHESIS VOICE CONVERSION

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

16 Dec 2017NVIDIA/tacotron2

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.

SPEECH SYNTHESIS

Handling Background Noise in Neural Speech Generation

23 Feb 2021google/lyra

Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding.

DENOISING SPEECH SYNTHESIS

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

NeurIPS 2020 TensorSpeech/TensorflowTTS

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.

SPEECH SYNTHESIS

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

ICLR 2021 TensorSpeech/TensorflowTTS

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

KNOWLEDGE DISTILLATION SPEECH SYNTHESIS