no code implementations • 21 Apr 2025 • Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models.
no code implementations • 7 Mar 2025 • Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang, Hao Jiang
The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals.
no code implementations • 4 Mar 2025 • Zhen Yang, Guibao Shen, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen
In this paper, we propose RectifiedHR, an straightforward and efficient solution for training-free high-resolution image generation.
no code implementations • 10 Feb 2025 • D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu
Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation.
1 code implementation • 21 Nov 2024 • Jiacheng Ying, Mushui Liu, Zhe Wu, Runming Zhang, Zhu Yu, Siming Fu, Si-Yuan Cao, Chao Wu, Yunlong Yu, Hui-Liang Shen
RestorerID is a diffusion model-based method that restores low-quality images with varying levels of degradation by using a single reference image.
no code implementations • 6 Sep 2024 • Weijie He, Mushui Liu, Yunlong Yu, Zheming Lu, Xi Li
Single-frame infrared small target (SIRST) detection poses a significant challenge due to the requirement to discern minute targets amidst complex infrared background clutter.
no code implementations • 22 Aug 2024 • Bozheng Li, Mushui Liu, Gaoang Wang, Yunlong Yu
In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings.
no code implementations • 22 Aug 2024 • Mushui Liu, Fangtai Wu, Bozheng Li, Ziqian Lu, Yunlong Yu, Xi Li
Few-shot learning (FSL) aims to recognize new concepts using a limited number of visual samples.
1 code implementation • 12 Aug 2024 • Mushui Liu, Bozheng Li, Yunlong Yu
In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features.
1 code implementation • 10 Jul 2024 • Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, Leilei Gan, Hao Jiang
Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis.
no code implementations • 4 Jul 2024 • Mushui Liu, Bozheng Li, Yunlong Yu
Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks.
no code implementations • 30 Jun 2024 • Mushui Liu, Yuhang Ma, Yang Zhen, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, Changjie Fan
Diffusion models have exhibited substantial success in text-to-image generation.
1 code implementation • 17 May 2024 • Mushui Liu, Jun Dan, Ziqian Lu, Yunlong Yu, Yingming Li, Xi Li
In this paper, we propose CM-UNet, comprising a CNN-based encoder for extracting local image features and a Mamba-based decoder for aggregating and integrating global information, facilitating efficient semantic segmentation of remote sensing images.
1 code implementation • 6 Dec 2023 • Mushui Liu, Weijie He, Ziqian Lu, Yunlong Yu
Prompt learning is a powerful technique for transferring Vision-Language Models (VLMs) such as CLIP to downstream tasks.