Voice Cloning
27 papers with code • 0 benchmarks • 2 datasets
Voice cloning is a highly desired feature for personalized speech interfaces. Neural voice cloning system learns to synthesize a person’s voice from only a few audio samples.
Benchmarks
These leaderboards are used to track progress in Voice Cloning
Libraries
Use these libraries to find Voice Cloning models and implementationsMost implemented papers
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.
Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert
Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).
Neural Voice Cloning with a Few Samples
Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples.
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing.
Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques
The growing use of voice user interfaces has led to a surge in the collection and storage of speech data.
WavLM model ensemble for audio deepfake detection
Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities.
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.
Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech
The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.