Speech Synthesis
290 papers with code • 4 benchmarks • 19 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
Libraries
Use these libraries to find Speech Synthesis models and implementationsDatasets
Subtasks
- Expressive Speech Synthesis
- Emotional Speech Synthesis
- text-to-speech translation
- Speech Synthesis - Tamil
- Speech Synthesis - Tamil
- Speech Synthesis - Kannada
- Speech Synthesis - Malayalam
- Speech Synthesis - Telugu
- Speech Synthesis - Assamese
- Speech Synthesis - Bengali
- Speech Synthesis - Bodo
- Speech Synthesis - Gujarati
- Speech Synthesis - Hindi
- Speech Synthesis - Manipuri
- Speech Synthesis - Marathi
- Speech Synthesis - Rajasthani
Latest papers with no code
Towards Accurate Lip-to-Speech Synthesis in-the-Wild
In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
This forces the model to learn a speaker distribution disentangled from the semantic content.
Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth).
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting
Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker factored approximations produces a better preservation of the pre-training knowledge than the diagonal ones.
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model
Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks.
Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.
EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks
The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns.
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis
Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation.
MunTTS: A Text-to-Speech System for Mundari
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family.