no code implementations • 5 Aug 2024 • Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang
StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 12 Jun 2024 • Yue Li, Xinsheng Wang, Li Zhang, Lei Xie
Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet.
no code implementations • 19 Jan 2024 • Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang
Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech.
no code implementations • 3 Sep 2023 • Zhichao Wang, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie, Qiao Tian, Yuping Wang
In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation.
no code implementations • 16 Nov 2022 • Zhichao Wang, Xinsheng Wang, Lei Xie, Yuanzhe Chen, Qiao Tian, Yuping Wang
Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC).
no code implementations • 23 Dec 2021 • Qicong Xie, Tao Li, Xinsheng Wang, Zhichao Wang, Lei Xie, Guoqiao Yu, Guanglu Wan
Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.
no code implementations • 9 Aug 2021 • Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg
In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
1 code implementation • 23 Oct 2020 • Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes.
2 code implementations • 14 May 2020 • Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg
An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies.
no code implementations • 1 Feb 2020 • Xinsheng Wang, Shanmin Pang, Jihua Zhu
In the generalized zero-shot learning, synthesizing unseen data with generative models has been the most popular method to address the imbalance of training data between seen and unseen classes.
1 code implementation • 25 Dec 2019 • Ke Zhang, Yurong Guo, Xinsheng Wang, Dongliang Chang, Zhenbing Zhao, Zhanyu Ma, Tony X. Han
However, during the training of the deep convolutional neural network, the value of NLLR is not always positive or negative, which severely affects the convergence of NLLR.
no code implementations • 31 Jul 2019 • Ke Zhang, Xinsheng Wang, Yurong Guo, Zhenbing Zhao, Zhanyu Ma, Tony X. Han
A lot of studies of image classification based on deep convolutional neural network focus on the network structure to improve the image classification performance.
no code implementations • 30 Jun 2019 • Xinsheng Wang, Shanmin Pang, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Yaochen Li
The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive.