Speech Synthesis

290 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Libraries

Use these libraries to find Speech Synthesis models and implementations

Latest papers with no code

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

no code yet • 2 Mar 2024

In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.

VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis

no code yet • 1 Mar 2024

This forces the model to learn a speaker distribution disentangled from the semantic content.

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

no code yet • 29 Feb 2024

Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth).

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting

no code yet • 19 Feb 2024

Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker factored approximations produces a better preservation of the pre-training knowledge than the diagonal ones.

Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

no code yet • 16 Feb 2024

Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks.

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

no code yet • 11 Feb 2024

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

no code yet • 31 Jan 2024

Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.

EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

no code yet • 31 Jan 2024

The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns.

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

no code yet • 30 Jan 2024

Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation.

MunTTS: A Text-to-Speech System for Mundari

no code yet • 28 Jan 2024

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family.