Speech Synthesis

290 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Benchmarks

Add a Result

These leaderboards are used to track progress in Speech Synthesis

Dataset	Best Model	Compare
LibriTTS	EVA-GAN-big	See all
North American English		See all
LJSpeech	BDDM vocoder	See all
Mandarin Chinese	WaveNet (L+F)	See all

Libraries

Use these libraries to find Speech Synthesis models and implementations

coqui-ai/TTS

15 papers

29,239

PaddlePaddle/PaddleSpeech

15 papers

10,142

TensorSpeech/TensorflowTTS

6 papers

3,701

keonlee9420/Expressive-FastSpeech2

4 papers

259

See all 22 libraries.

Datasets

Subtasks

Speech Synthesis - Tamil

Speech Synthesis - Kannada

Speech Synthesis - Malayalam

Speech Synthesis - Telugu

Speech Synthesis - Assamese

Speech Synthesis - Bengali

Speech Synthesis - Bodo

Speech Synthesis - Gujarati

Speech Synthesis - Hindi

Speech Synthesis - Manipuri

Speech Synthesis - Marathi

Speech Synthesis - Rajasthani

Latest papers with no code

Most implemented Social Latest No code

FlashSpeech: Efficient Zero-Shot Speech Synthesis

no code yet • 23 Apr 2024

The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation.

Paper
Add Code

Retrieval-Augmented Audio Deepfake Detection

no code yet • 22 Apr 2024

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse.

Paper
Add Code

Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications

no code yet • 21 Apr 2024

The rise of deep learning has marked significant progress in fields such as computer vision, natural language processing, and medical imaging, primarily through the adaptation of pre-trained models for specific tasks.

Paper
Add Code

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness

no code yet • 10 Apr 2024

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes.

Paper
Add Code

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code yet • 4 Apr 2024

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Paper
Add Code

PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders

no code yet • 3 Apr 2024

Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc.

Paper
Add Code

Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

no code yet • 3 Apr 2024

Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech.

Paper
Add Code

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

no code yet • 1 Apr 2024

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis.

Paper
Add Code

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

no code yet • 25 Mar 2024

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics.

Paper
Add Code

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

no code yet • 21 Mar 2024

Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.

Paper
Add Code

Speech Synthesis

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result