Voice Conversion

164 papers with code • 3 benchmarks • 6 datasets

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Libraries

Use these libraries to find Voice Conversion models and implementations
3 papers
8,685
3 papers
2,310
See all 6 libraries.

Most implemented papers

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

kamepong/StarGAN-VC 6 Jun 2018

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

jjery2243542/adaptive_voice_conversion 10 Apr 2019

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

liusongxiang/StarGAN-Voice-Conversion 14 May 2019

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

jackaduma/CycleGAN-VC2 30 Nov 2017

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

lochenchou/MOSNet 17 Apr 2019

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

Utilizing Self-supervised Representations for MOS Prediction

s3prl/s3prl 7 Apr 2021

In this paper, we use self-supervised pre-trained models for MOS prediction.

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

jackaduma/CycleGAN-VC2 9 Apr 2019

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data.

Unsupervised Speech Decomposition via Triple Information Bottleneck

auspicious3000/SpeechSplit ICML 2020

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

s3prl/s3prl 5 Jun 2020

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

microsoft/speecht5 ACL 2022

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.