Search Results for author: Yongqi Wang

Found 15 papers, 3 papers with code

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

no code implementations19 Sep 2024 Yongqi Wang, Shuo Yang, Xinxiao wu, Jiebo Luo

To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework.

Decoder Object +5

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

no code implementations4 Jun 2024 RuiQi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao

We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity.

In-Context Learning Language Modeling +4

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

1 code implementation1 Jun 2024 Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, RuiQi Li, Zhou Zhao

By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video.

Video-to-Sound Generation

Robust Singing Voice Transcription Serves Synthesis

no code implementations16 May 2024 RuiQi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications.

Decoder Singing Voice Synthesis

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

1 code implementation18 Mar 2024 Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, RuiQi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly.

Attribute Decoder +1

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

no code implementations2 Mar 2024 Shuo Yang, Zirui Shang, Yongqi Wang, Derong Deng, Hongwei Chen, Qiyuan Cheng, Xinxiao wu

This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained Vision-Language Model (VLM) like CLIP to multilabel classification.

Language Modeling Language Modelling +2

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

no code implementations14 Sep 2023 Yongqi Wang, Jionghao Bai, Rongjie Huang, RuiQi Li, Zhiqing Hong, Zhou Zhao

The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity.

In-Context Learning Language Modeling +4

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

no code implementations30 May 2023 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.

Singing Voice Synthesis Text to Speech +1

Connecting Multi-modal Contrastive Representations

no code implementations NeurIPS 2023 Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao

This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR).

3D Point Cloud Classification counterfactual +4

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

1 code implementation8 Jul 2022 Yongqi Wang, Zhou Zhao

To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size.

Lip to Speech Synthesis Speech Synthesis

Cannot find the paper you are looking for? You can Submit a new open access paper.