Voice Cloning

Voice cloning is a highly desired feature for personalized speech interfaces. Neural voice cloning system learns to synthesize a person’s voice from only a few audio samples.


Most implemented papers

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

PaddlePaddle/DeepSpeech 9 Jul 2019

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

Neural Voice Cloning with a Few Samples

SforAiDl/Neural-Voice-Cloning-With-Few-Samples NeurIPS 2018

Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples.

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

PaddlePaddle/PaddleSpeech 7 Nov 2022

In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing.

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Tomiinek/Multilingual_Text_to_Speech 3 Aug 2020

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

ming024/FastSpeech2 6 Mar 2021

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

inconnu11/Objective-evaluation_speech_synthesis 22 Apr 2021

We achieve cross-lingual VC between Mandarin speech with multiple speakers and English speech with multiple speakers by applying bilingual bottleneck features.

Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

tpulkit/txt2vid 26 Jun 2021

Video represents the majority of internet traffic today, driving a continual race between the generation of higher quality content, transmission of larger file sizes, and the development of network infrastructure.

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

ndkgit339/fastspeech2-filled_pause_speech_synthesis 14 Oct 2022

We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge.

Low-Resource Multilingual and Zero-Shot Multispeaker TTS

digitalphonetics/ims-toucan 21 Oct 2022

While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6, 000 spoken languages.