Text-To-Speech Synthesis
92 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Latest papers with no code
Guided Flows for Generative Modeling and Decision Making
Classifier-free guidance is a key component for enhancing the performance of conditional generative models across diverse tasks.
Generative Pre-training for Speech with Flow Matching
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data.
Unified speech and gesture synthesis using flow matching
As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures.
The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.
DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis.
The FruitShell French synthesis system at the Blizzard 2023 Challenge
The evaluation results of our system showed a quality MOS score of 3. 6 for the Hub task and 3. 4 for the Spoke task, placing our system at an average level among all participating teams.
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style.
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features.
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space.
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement.