Speech Synthesis
286 papers with code • 4 benchmarks • 19 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
Libraries
Use these libraries to find Speech Synthesis models and implementationsDatasets
Subtasks
- Expressive Speech Synthesis
- Emotional Speech Synthesis
- text-to-speech translation
- Speech Synthesis - Tamil
- Speech Synthesis - Tamil
- Speech Synthesis - Kannada
- Speech Synthesis - Malayalam
- Speech Synthesis - Telugu
- Speech Synthesis - Assamese
- Speech Synthesis - Bengali
- Speech Synthesis - Bodo
- Speech Synthesis - Gujarati
- Speech Synthesis - Hindi
- Speech Synthesis - Manipuri
- Speech Synthesis - Marathi
- Speech Synthesis - Rajasthani
Latest papers
Towards Decoding Brain Activity During Passive Listening of Speech
The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech.
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms.
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision.
Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech
In this work, we propose to learn the AV representation from categorical emotion labels of speech.
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra
APNet demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed.
ChatGPT in the context of precision agriculture data analytics
In this work we argue that the speech recognition input modality of ChatGPT provides a more intuitive and natural way for policy makers to interact with the database of the server of an agricultural data processing system to which a large, dispersed network of automated insect traps and sensors probes reports.
Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning
The approach involved finetuning a multi-speaker TTS model to work with child speech.
ArTST: Arabic Text and Speech Transformer
We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language.