no code implementations • 17 Apr 2025 • Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions.
1 code implementation • 3 Apr 2025 • Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, Yahui Zhou
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e. g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element.
1 code implementation • 15 Feb 2025 • Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, Xiang Wen
We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation.
1 code implementation • 3 Jan 2025 • Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \texttt{Ingredients}.
1 code implementation • 14 Dec 2024 • Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation.
2 code implementations • 1 Sep 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang
This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic.
Ranked #2 on
Text-to-Music Generation
on MusicCaps
1 code implementation • 16 Jul 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference.
no code implementations • 3 Jun 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements.
no code implementations • 20 Apr 2024 • Zhengcong Fei, Mingyuan Fan, Junshi Huang
Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps.
1 code implementation • 6 Apr 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields.
2 code implementations • 8 Feb 2024 • Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang
We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space.
no code implementations • 22 Dec 2023 • Xiaoyue Duan, Shuhao Cui, Guoliang Kang, Baochang Zhang, Zhengcong Fei, Mingyuan Fan, Junshi Huang
Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e. g., changing postures) to the main objects in the input image without changing their identity or attributes.
no code implementations • 27 Nov 2023 • Zhengcong Fei, Mingyuan Fan, Junshi Huang
The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i. e.}, target encoder, on the whole spectrogram.
no code implementations • 10 Sep 2023 • Guisheng Liu, Yi Li, Zhengcong Fei, Haiyan Fu, Xiangyang Luo, Yanqing Guo
While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems.
1 code implementation • 7 Aug 2023 • Yuchen Ma, Zhengcong Fei, Junshi Huang
The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens.
no code implementations • 12 Apr 2023 • Zhengcong Fei, Mingyuan Fan, Junshi Huang
Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent.
1 code implementation • CVPR 2023 • Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, Xiaolin Wei
In this paper, we introduce a novel Generative Adversarial Networks alike framework, referred to as GAN-MAE, where a generator is used to generate the masked patches according to the remaining visible patches, and a discriminator is employed to predict whether the patch is synthesized by the generator.
no code implementations • 30 Nov 2022 • Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, Xiaolin Wei
It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it.
no code implementations • 5 Oct 2022 • Zhengcong Fei, Shuman Tian, Junshi Huang, Xiaoming Wei, Xiaolin Wei
Knowledge distillation is an approach that allows a single model to efficiently capture the approximate performance of an ensemble while showing poor scalability as demand for re-training when introducing new teacher models.
no code implementations • 5 Oct 2022 • Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang
Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space.
1 code implementation • Findings (ACL) 2022 • Zhexin Zhang, Yeshuang Zhu, Zhengcong Fei, Jinchao Zhang, Jie zhou
With the increasing popularity of online chatting, stickers are becoming important in our online communication.
1 code implementation • 22 Jul 2022 • Zhengcong Fei, Junshi Huang, Xiaoming Wei, Xiaolin Wei
Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words.
1 code implementation • CVPR 2022 • Zhengcong Fei, Xu Yan, Shuhui Wang, Qi Tian
On one hand, the representation in shallow layers lacks high-level semantic and sufficient cross-modal fusion information for accurate prediction.
no code implementations • 19 Nov 2021 • Xu Yan, Zhengcong Fei, Shuhui Wang, Qingming Huang, Qi Tian
Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity.
1 code implementation • 11 Oct 2021 • Xu Yan, Zhengcong Fei, Zekang Li, Shuhui Wang, Qingming Huang, Qi Tian
Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
1 code implementation • 4 Sep 2021 • Zhengcong Fei, Zekang Li, Jinchao Zhang, Yang Feng, Jie zhou
Compared to previous dialogue tasks, MOD is much more challenging since it requires the model to understand the multimodal elements as well as the emotions behind them.
1 code implementation • Findings (ACL) 2021 • Zekang Li, Jinchao Zhang, Zhengcong Fei, Yang Feng, Jie zhou
Employing human judges to interact with chatbots on purpose to check their capacities is costly and low-efficient, and difficult to get rid of subjective bias.
1 code implementation • ACL 2021 • Zekang Li, Jinchao Zhang, Zhengcong Fei, Yang Feng, Jie zhou
Nowadays, open-domain dialogue models can generate acceptable responses according to the historical context based on the large-scale pre-trained language models.