Voice Conversion
114 papers with code • 1 benchmarks • 2 datasets
Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.
Libraries
Use these libraries to find Voice Conversion models and implementationsLatest papers
TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics.
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models
Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models.
Speaking Style Conversion With Discrete Self-Supervised Units
We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate the proposed approach is significantly superior to the evaluated baselines.
SpeechLMScore: Evaluating speech generation using speech language model
While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming.
Hiding speaker's sex in speech using zero-evidence speaker representation in an analysis/synthesis pipeline
The use of modern vocoders in an analysis/synthesis pipeline allows us to investigate high-quality voice conversion that can be used for privacy purposes.
A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units
To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
Voice conversion (VC) can be achieved by first extracting source content information and target speaker information, and then reconstructing waveform with these information.
GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models
As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.
Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward
We report the performance of these countermeasures on several datasets and evaluate them across corpora.
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed.