Search Results for author: Jinzheng He

Found 12 papers, 5 papers with code

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

1 code implementation16 Jan 2024 Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.

3D Reconstruction Super-Resolution +1

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

no code implementations17 Dec 2023 Yu Zhang, Rongjie Huang, RuiQi Li, Jinzheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase.

Quantization Singing Voice Synthesis +1

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

no code implementations14 Jul 2023 Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage.

In-Context Learning Language Modelling +3

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

no code implementations24 May 2023 Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao

Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.

Speech-to-Speech Translation Translation

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

no code implementations22 May 2023 Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao

To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information.

Denoising Self-Supervised Learning

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

no code implementations21 May 2023 Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao

Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data.

SQL Parsing

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

no code implementations18 May 2023 Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui, Huadai Liu, Zhou Zhao

To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience.

Singing Voice Synthesis

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

no code implementations1 May 2023 Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.

motion prediction Talking Face Generation

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

1 code implementation31 Jan 2023 Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao

Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.

Lip Reading Talking Face Generation +1

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

1 code implementation25 May 2022 Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao

Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e. g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism.

Representation Learning Speech Synthesis +2

PopMAG: Pop Music Accompaniment Generation

1 code implementation18 Aug 2020 Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.

Music Modeling

Cannot find the paper you are looking for? You can Submit a new open access paper.