3 code implementations • 1 Jun 2025 • Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored.
1 code implementation • 17 Apr 2025 • Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions.
1 code implementation • 3 Apr 2025 • Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, Yahui Zhou
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e. g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element.
1 code implementation • 3 Jan 2025 • Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \texttt{Ingredients}.
1 code implementation • 14 Dec 2024 • Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation.
1 code implementation • 16 Jul 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference.
no code implementations • 3 Jun 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements.
1 code implementation • 6 Apr 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields.
1 code implementation • CVPR 2020 • Debang Li, Junge Zhang, Kaiqi Huang, Ming-Hsuan Yang
However, the mutual relations between the candidates from an image play an essential role in composing a good shot due to the comparative nature of this problem.
1 code implementation • CVPR 2020 • Debang Li, Junge Zhang, Kaiqi Huang
In addition, both the intermediate and final results show that the proposed model can predict different cropping windows for an image depending on different aspect ratio requirements.
3 code implementations • CVPR 2018 • Debang Li, Huikai Wu, Junge Zhang, Kaiqi Huang
Image cropping aims at improving the aesthetic quality of images by adjusting their composition.