ClsVC: Learning Speech Representations with two different classification tasks.

29 Sep 2021  ·  Tang huaizhen, xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao ·

Voice conversion(VC) aims to convert one speaker's voice to generate a new speech as it is said by another speaker. Previous works focus on learning latent representation by applying two different encoders to learn content information and timbre information from the input speech respectively. However, whether they apply a bottleneck network or vector quantify technology, it is very difficult to perfectly separate the speaker and the content information from a speech signal. In this paper, we propose a novel voice conversion framework, 'ClsVC', to address this problem. It uses only one encoder to get both timbre and content information by dividing the latent space. Besides, some constraints are proposed to ensure the different part of latent space only contains separating content and timbre information respectively. We have shown the necessity to set these constraints, and we also experimentally prove that even if we change the division proportion of latent space, the content and timbre information will be always well separated. Experiments on the VCTK dataset show ClsVC is a state-of-the-art framework in terms of the naturalness and similarity of converted speech.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here