Text-To-Speech Synthesis
97 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Most implemented papers
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.
Tacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units.
FastSpeech: Fast, Robust and Controllable Text to Speech
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Efficient Neural Audio Synthesis
The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Clone a voice in 5 seconds to generate arbitrary speech in real-time
FastSpeech: Fast,Robustand Controllable Text-to-Speech
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.