Voice Conversion

98 papers with code • 1 benchmarks • 2 datasets

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet


Use these libraries to find Voice Conversion models and implementations


Most implemented papers

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

liusongxiang/StarGAN-Voice-Conversion 6 Jun 2018

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

jjery2243542/adaptive_voice_conversion 10 Apr 2019

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

liusongxiang/StarGAN-Voice-Conversion 14 May 2019

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

jackaduma/CycleGAN-VC2 30 Nov 2017

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

lochenchou/MOSNet 17 Apr 2019

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

Unsupervised Speech Decomposition via Triple Information Bottleneck

auspicious3000/SpeechSplit ICML 2020

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

s3prl/s3prl 5 Jun 2020

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

jackaduma/CycleGAN-VC2 13 Oct 2016

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

r9y9/gantts 23 Sep 2017

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

jjery2243542/voice_conversion 9 Apr 2018

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.