Voice Conversion

138 papers with code • 2 benchmarks • 5 datasets

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet


Use these libraries to find Voice Conversion models and implementations

Most implemented papers

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

kamepong/StarGAN-VC 6 Jun 2018

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

jjery2243542/adaptive_voice_conversion 10 Apr 2019

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

liusongxiang/StarGAN-Voice-Conversion 14 May 2019

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

jackaduma/CycleGAN-VC2 30 Nov 2017

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

lochenchou/MOSNet 17 Apr 2019

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

Unsupervised Speech Decomposition via Triple Information Bottleneck

auspicious3000/SpeechSplit ICML 2020

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.

Utilizing Self-supervised Representations for MOS Prediction

s3prl/s3prl 7 Apr 2021

In this paper, we use self-supervised pre-trained models for MOS prediction.

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

jackaduma/CycleGAN-VC2 9 Apr 2019

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data.

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning 5 Jun 2020

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

jackaduma/CycleGAN-VC2 13 Oct 2016

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.