1 code implementation • 5 May 2025 • Inclusion AI, Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language.
no code implementations • 10 Mar 2025 • Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships.
1 code implementation • 11 Dec 2024 • Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, Zheng-Jun Zha
Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs).
no code implementations • 10 Dec 2024 • Shuailei Ma, Kecheng Zheng, Ying WEI, Wei Wu, Fan Lu, Yifei Zhang, Chen-Wei Xie, Biao Gong, Jiapeng Zhu, Yujun Shen
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive.
no code implementations • 8 Dec 2024 • Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng
We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator.
no code implementations • 4 Dec 2024 • Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang
Text serves as the key control signal in video generation due to its narrative nature.
1 code implementation • 26 Nov 2024 • Chengxing Xie, XiaoMing Zhang, Linze Li, Yuqian Fu, Biao Gong, Tianrui Li, Kai Zhang
In line with this insight, we introduce Multi-Range Attention Transformer (MAT) tailored for SR tasks.
no code implementations • 30 Oct 2024 • Yuxin Zhang, Dandan Zheng, Biao Gong, Jingdong Chen, Ming Yang, WeiMing Dong, Changsheng Xu
Lighting plays a pivotal role in ensuring the naturalness of video generation, significantly influencing the aesthetic quality of the generated content.
no code implementations • 24 Oct 2024 • Wen Wang, Qiuyu Wang, Kecheng Zheng, Hao Ouyang, Zhekai Chen, Biao Gong, Hao Chen, Yujun Shen, Chunhua Shen
We propose Framer for interactive frame interpolation, which targets producing smoothly transitioning frames between two images as per user creativity.
no code implementations • 14 Oct 2024 • Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang
Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character.
1 code implementation • 4 Sep 2024 • Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, Ming Yang
To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer.
no code implementations • 30 Aug 2024 • Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang
To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture.
1 code implementation • 26 Aug 2024 • Chaochao Chen, Jiaming Zhang, Yizhao Zhang, Li Zhang, Lingjuan Lyu, Yuyuan Li, Biao Gong, Chenggang Yan
Specifically, we consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels.
no code implementations • 22 Jul 2024 • Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang
This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs).
no code implementations • 24 Jun 2024 • Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, Yinqiang Zheng
Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns.
no code implementations • CVPR 2024 • Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu
To align the generated image with layout instructions we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time.
1 code implementation • CVPR 2024 • Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang
Following such a pipeline, we study the effect of doubling the scale of training set (i. e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9. 67 to 8. 19 and FVD from 484 to 441), demonstrating the scalability of our approach.
Ranked #7 on
Text-to-Video Generation
on MSR-VTT
1 code implementation • CVPR 2024 • Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions.
no code implementations • 27 Nov 2023 • Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu
To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time.
no code implementations • CVPR 2024 • Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang
Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance.
no code implementations • 6 Jun 2023 • Xiaoying Xie, Biao Gong, Yiliang Lv, Zhen Han, Guoshuai Zhao, Xueming Qian
Most recent works focus on answering first order logical queries to explore the knowledge graph reasoning via multi-hop logic predictions.
no code implementations • 20 Apr 2023 • Yuyuan Li, Chaochao Chen, Xiaolin Zheng, Yizhao Zhang, Biao Gong, Jun Wang
In this paper, we first identify two main disadvantages of directly applying existing unlearning methods in the context of recommendation, i. e., (i) unsatisfactory efficiency for large-scale recommendation models and (ii) destruction of collaboration across users and items.
1 code implementation • CVPR 2024 • Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs.
1 code implementation • ICCV 2023 • Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin Peng, Deli Zhao
Video temporal grounding aims to pinpoint a video segment that matches the query description.
no code implementations • ICCV 2023 • Yutong Feng, Biao Gong, Jianwen Jiang, Yiliang Lv, Yujun Shen, Deli Zhao, Jingren Zhou
ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared frozen backbone.
no code implementations • 14 Feb 2023 • Biao Gong, Shuai Tan, Yutong Feng, Xiaoying Xie, Yuyuan Li, Chaochao Chen, Kecheng Zheng, Yujun Shen, Deli Zhao
This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data.
1 code implementation • CVPR 2023 • Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models.
no code implementations • 1 Feb 2020 • Chenggang Yan, Biao Gong, Yuxuan Wei, Yue Gao
Therefore, we try to introduce the multi-view deep neural network into the hash learning field, and design an efficient and innovative retrieval model, which has achieved a significant improvement in retrieval performance.