Search Results for author: Huadai Liu

Found 12 papers, 5 papers with code

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

1 code implementation13 Dec 2024 Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

In-Context Learning Quantization +1

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

1 code implementation16 Oct 2024 Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Wei Xue, Zhou Zhao

Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment.

Audio Generation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

no code implementations14 Oct 2024 Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions.

Audio Generation multimodal generation

MEDIC: Zero-shot Music Editing with Disentangled Inversion Control

no code implementations18 Jul 2024 Huadai Liu, Jialei Wang, Xiangtai Li, Rongjie Huang, Yang Liu, Jiayang Xu, Zhou Zhao

To counteract these issues, we introduce the Disentangled Inversion technique to disentangle the diffusion process into triple branches, rectifying the deviated path of the source branch caused by DDIM inversion.

Audio Generation

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

2 code implementations1 Jun 2024 Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver.

Audio Generation Audio Synthesis

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

no code implementations24 May 2023 Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao

Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.

Speech-to-Speech Translation Translation

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

no code implementations22 May 2023 Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao

To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information.

Decoder Denoising +3

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

no code implementations21 May 2023 Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao

Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data.

SQL Parsing

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

no code implementations18 May 2023 Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui, Huadai Liu, Zhou Zhao

To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience.

Singing Voice Synthesis

AntCritic: Argument Mining for Free-Form and Visually-Rich Financial Comments

no code implementations20 Aug 2022 Huadai Liu, Wenqiang Xu, Xuan Lin, Jingjing Huo, Hong Chen, Zhou Zhao

Argument mining aims to detect all possible argumentative components and identify their relationships automatically.

Argument Mining Form +1

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

4 code implementations13 Jul 2022 Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren

Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.

Denoising Knowledge Distillation +4

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

1 code implementation25 May 2022 Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao

Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e. g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism.

Representation Learning Rhythm +4

Cannot find the paper you are looking for? You can Submit a new open access paper.