Speech Synthesis

290 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Libraries

Use these libraries to find Speech Synthesis models and implementations

Latest papers with no code

FlashSpeech: Efficient Zero-Shot Speech Synthesis

no code yet • 23 Apr 2024

The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation.

Retrieval-Augmented Audio Deepfake Detection

no code yet • 22 Apr 2024

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse.

Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications

no code yet • 21 Apr 2024

The rise of deep learning has marked significant progress in fields such as computer vision, natural language processing, and medical imaging, primarily through the adaptation of pre-trained models for specific tasks.

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness

no code yet • 10 Apr 2024

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes.

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code yet • 4 Apr 2024

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders

no code yet • 3 Apr 2024

Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc.

Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

no code yet • 3 Apr 2024

Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech.

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

no code yet • 1 Apr 2024

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis.

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

no code yet • 25 Mar 2024

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics.

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

no code yet • 21 Mar 2024

Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.