no code implementations • 19 Sep 2024 • Yongqi Wang, Shuo Yang, Xinxiao wu, Jiebo Luo
To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework.
no code implementations • 2 Jul 2024 • RuiQi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices.
no code implementations • 4 Jun 2024 • RuiQi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao
We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity.
1 code implementation • 1 Jun 2024 • Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, RuiQi Li, Zhou Zhao
By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video.
Ranked #4 on Video-to-Sound Generation on VGG-Sound
no code implementations • 16 May 2024 • RuiQi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao
Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications.
no code implementations • 14 Apr 2024 • Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, RuiQi Li, Fuming You, Zhou Zhao, Zhimeng Zhang
A song is a combination of singing voice and accompaniment.
no code implementations • 20 Mar 2024 • Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu
Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting.
no code implementations • 19 Mar 2024 • Jun Yu, Gongpeng Zhao, Yongqi Wang, Zhihong Wei, Yang Zheng, Zerui Zhang, Zhongpeng Cai, Guochen Xie, Jichao Zhu, Wangyuan Zhu
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition.
no code implementations • 18 Mar 2024 • Jun Yu, Zhihong Wei, Zhongpeng Cai, Gongpeng Zhao, Zerui Zhang, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu
Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields.
Facial Expression Recognition Facial Expression Recognition (FER)
1 code implementation • 18 Mar 2024 • Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, RuiQi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly.
no code implementations • 2 Mar 2024 • Shuo Yang, Zirui Shang, Yongqi Wang, Derong Deng, Hongwei Chen, Qiyuan Cheng, Xinxiao wu
This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained Vision-Language Model (VLM) like CLIP to multilabel classification.
no code implementations • 14 Sep 2023 • Yongqi Wang, Jionghao Bai, Rongjie Huang, RuiQi Li, Zhiqing Hong, Zhou Zhao
The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity.
no code implementations • 30 May 2023 • Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.
no code implementations • NeurIPS 2023 • Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR).
1 code implementation • 8 Jul 2022 • Yongqi Wang, Zhou Zhao
To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size.