Text-To-Speech Synthesis
92 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Latest papers
Matcha-TTS: A fast TTS architecture with conditional flow matching
We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM).
Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context.
Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek.
Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert
Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
This is the first time that face images are used as a condition to train a TTS model.
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis
With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged.