Text-To-Speech Synthesis

90 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Libraries

Use these libraries to find Text-To-Speech Synthesis models and implementations

Most implemented papers

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

MoonInTheRiver/DiffSinger 6 May 2021

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.

Neural Speech Synthesis with Transformer Network

PaddlePaddle/PaddleSpeech 19 Sep 2018

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

huawei-noah/Speech-Backbones 13 May 2021

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

microsoft/unilm 5 Jan 2023

In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

Exploring Transfer Learning for Low Resource Emotional TTS

Emotional-Text-to-Speech/dl-for-emo-tts Advances in Intelligent Systems and Computing 2019

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning.

MelNet: A Generative Model for Audio in the Frequency Domain

fatchord/MelNet 4 Jun 2019

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

jaywalnut310/glow-tts NeurIPS 2020

By leveraging the properties of flows, MAS searches for the most probable monotonic alignment between text and the latent representation of speech.

Tools and resources for Romanian text-to-speech and speech-to-text applications

racai-ai/TEPROLIN 15 Feb 2018

In this paper we introduce a set of resources and tools aimed at providing support for natural language processing, text-to-speech synthesis and speech recognition for Romanian.

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

NVIDIA/flowtron ICLR 2021

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

maum-ai/wavegrad2 17 Jun 2021

The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.