1 code implementation • 19 Feb 2024 • Shengpeng Ji, Minghui Fang, Ziyue Jiang, Rongjie Huang, Jialung Zuo, Shulei Wang, Zhou Zhao
Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models.
no code implementations • 14 Feb 2024 • Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao
Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation.
no code implementations • 12 Feb 2024 • Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, YuanJun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou
By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research.
no code implementations • 9 Feb 2024 • Wenyu Li, Yinuo Zhu, Xin Lin, Ming Li, Ziyue Jiang, Ziqian Zeng
Traditional discriminative approaches in mental health analysis are known for their strong capacity but lack interpretability and demand large-scale annotated data.
1 code implementation • 16 Jan 2024 • Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.
1 code implementation • 21 Sep 2023 • Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
no code implementations • 28 Aug 2023 • Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
The dataset comprises 236, 220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples.
no code implementations • 14 Jul 2023 • Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage.
no code implementations • 6 Jun 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
We are interested in a novel task, namely low-resource text-to-talking avatar.
no code implementations • 6 Jun 2023 • Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies.
no code implementations • 30 May 2023 • Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.
no code implementations • 1 May 2023 • Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao
Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
1 code implementation • 31 Jan 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao
Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.
1 code implementation • 5 Jun 2022 • Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language).
no code implementations • 14 Oct 2021 • Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally.
no code implementations • 29 Sep 2021 • Ziyue Jiang, Yi Ren, Zhou Zhao
In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2