Text to Speech

340 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Datasets


Most implemented papers

WaveNet: A Generative Model for Raw Audio

ibab/tensorflow-wavenet 12 Sep 2016

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

coqui-ai/TTS ICLR 2021

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Tacotron: Towards End-to-End Speech Synthesis

CorentinJ/Real-Time-Voice-Cloning 29 Mar 2017

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

coqui-ai/TTS 24 Oct 2017

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units.

FastSpeech: Fast, Robust and Controllable Text to Speech

coqui-ai/TTS NeurIPS 2019

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Efficient Neural Audio Synthesis

CorentinJ/Real-Time-Voice-Cloning ICML 2018

The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

coqui-ai/TTS 25 Oct 2019

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.

FastSpeech: Fast,Robustand Controllable Text-to-Speech

PaddlePaddle/PaddleSpeech 22 May 2019

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

Robust universal neural vocoding

bshall/ZeroSpeech 15 Nov 2018

This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages.