Text-To-Speech Synthesis

92 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Libraries

Use these libraries to find Text-To-Speech Synthesis models and implementations

Matcha-TTS: A fast TTS architecture with conditional flow matching

shivammehta25/Matcha-TTS 6 Sep 2023

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM).

381
06 Sep 2023

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

choijeongsoo/utut 3 Aug 2023

A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).

15
03 Aug 2023

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

lucidrains/voicebox-pytorch NeurIPS 2023

Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context.

503
23 Jun 2023

Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration

is2ai/turkictts 25 May 2023

This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek.

38
25 May 2023

Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert

serp-ai/bark-with-voice-clone Social Science Research Network (SSRN) 2023

Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints

2,798
18 Apr 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

plachtaa/vall-e-x 7 Mar 2023

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.

7,138
07 Mar 2023

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

naver-ai/facetts 27 Feb 2023

This is the first time that face images are used as a condition to train a TTS model.

43
27 Feb 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

b04901014/mqtts 8 Feb 2023

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness.

224
08 Feb 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

suno-ai/bark 5 Jan 2023

In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

32,377
05 Jan 2023

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

shinhyeokoh/rwen 15 Dec 2022

With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged.

14
15 Dec 2022