no code implementations • 10 Mar 2025 • Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships.
no code implementations • 12 Dec 2024 • Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions.
no code implementations • 28 Nov 2024 • Feng Liu, Shiwei Zhang, XiaoFeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising.
no code implementations • 26 Nov 2024 • Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yujie Wei, Zekun Li, Yingya Zhang, Boxi Wu, Deng Cai
The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection.
no code implementations • 13 Nov 2024 • XiaoFeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Guosheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, Xingang Wang
Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments.
no code implementations • 17 Oct 2024 • Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, Hongming Shan
In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning.
1 code implementation • 9 Oct 2024 • Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, YuChao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou
Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability.
no code implementations • 30 Sep 2024 • Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang
Text-to-video diffusion models have made remarkable advancements.
no code implementations • 18 Aug 2024 • Dongze Li, Kang Zhao, Wei Wang, Yifeng Ma, Bo Peng, Yingya Zhang, Jing Dong
Talking head synthesis is a practical technique with wide applications.
no code implementations • 3 Jun 2024 • Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang
First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model.
1 code implementation • CVPR 2024 • Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang
Following such a pipeline, we study the effect of doubling the scale of training set (i. e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9. 67 to 8. 19 and FVD from 484 to 441), demonstrating the scalability of our approach.
Ranked #7 on
Text-to-Video Generation
on MSR-VTT
1 code implementation • CVPR 2024 • Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.
no code implementations • 18 Dec 2023 • Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, Tieniu Tan
Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality.
1 code implementation • 15 Dec 2023 • Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, Zhidong Deng
To more conveniently specify personalized emotions, a diffusion-based style predictor is utilized to predict the personalized emotion directly from the audio, eliminating the need for extra emotion reference.
2 code implementations • 14 Dec 2023 • Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang
Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models.
1 code implementation • CVPR 2024 • Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan
In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern.
1 code implementation • CVPR 2024 • Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, Nong Sang
At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser.
Ranked #6 on
Text-to-Video Generation
on MSR-VTT
3 code implementations • 7 Nov 2023 • Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
no code implementations • 16 Oct 2023 • Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text.
1 code implementation • 12 Oct 2023 • Yueming Lyu, Kang Zhao, Bo Peng, Yue Jiang, Yingya Zhang, Jing Dong
Based on DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model during the training phase, and predicts the latent space directions from the CLIP textual feature differences during the inference phase.
1 code implementation • ICCV 2023 • Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
When pre-training on the large-scale Kinetics-710, we achieve 89. 7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST.
3 code implementations • ICCV 2023 • Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao
In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data.
Ranked #1 on
Zero-Shot Human-Object Interaction Detection
on HICO-DET
(using extra training data)
5 code implementations • 12 Aug 2023 • Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i. e., Stable Diffusion).
Ranked #9 on
Text-to-Video Generation
on MSR-VTT
1 code implementation • 10 Aug 2023 • Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Yingya Zhang, Ziwei Liu, Marcelo H. Ang Jr
Spatial convolutions are extensively used in numerous deep video models.
Ranked #4 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
1 code implementation • 9 Jul 2023 • Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, Qifeng Chen
In this way, RGB images are not required during inference anymore since the 2D knowledge branch provides 2D information according to the 3D LIDAR input.
no code implementations • 27 Jun 2023 • Tianxiang Ma, Kang Zhao, Jianxin Sun, Yingya Zhang, Jing Dong
Efficiently generating a freestyle 3D portrait with high quality and 3D-consistency is a promising yet challenging task.
4 code implementations • NeurIPS 2023 • Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis.
Ranked #5 on
Text-to-Video Generation
on EvalCrafter Text-to-Video (ECTV) Dataset
(using extra training data)
1 code implementation • CVPR 2023 • Xiang Wang, Shiwei Zhang, Zhiwu Qing, Changxin Gao, Yingya Zhang, Deli Zhao, Nong Sang
To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
1 code implementation • CVPR 2023 • Jun Cen, Shiwei Zhang, Xiang Wang, Yixuan Pei, Zhiwu Qing, Yingya Zhang, Qifeng Chen
In this paper, we begin with analyzing the feature representation behavior in the open-set action recognition (OSAR) problem based on the information bottleneck (IB) theory, and propose to enlarge the instance-specific (IS) and class-specific (CS) information contained in the feature for better performance.
1 code implementation • CVPR 2023 • Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Ranked #11 on
Video Generation
on UCF-101
1 code implementation • 6 Mar 2023 • Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, Nong Sang
Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task.
1 code implementation • 8 Feb 2023 • Jun Cen, Di Luan, Shiwei Zhang, Yixuan Pei, Yingya Zhang, Deli Zhao, Shaojie Shen, Qifeng Chen
Recently, Unified Open-set Recognition (UOSR) has been proposed to reject not only unknown samples but also known but wrongly classified samples, which tends to be more practical in real-world applications.
no code implementations • ICCV 2023 • Yixuan Pei, Zhiwu Qing, Shiwei Zhang, Xiang Wang, Yingya Zhang, Deli Zhao, Xueming Qian
In this paper, we will fill this gap by learning multiple prompts based on a powerful image-language pre-trained model, i. e., CLIP, making it fit for video class-incremental learning (VCIL).
no code implementations • CVPR 2023 • Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
Generating a talking face video from the input audio sequence is a practical yet challenging task.
no code implementations • 14 Oct 2022 • Kun Yuan, Xinmeng Huang, Yiming Chen, Xiaohan Zhang, Yingya Zhang, Pan Pan
While (Lu and Sa, 2021) have recently provided an optimal rate for non-convex stochastic decentralized optimization with weight matrices defined over linear graphs, the optimal rate with general weight matrices remains unclear.
no code implementations • 29 Sep 2021 • Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Yingya Zhang, Pan Pan, Wotao Yin
Decentralized adaptive gradient methods, in which each node averages only with its neighbors, are critical to save communication and wall-clock training time in deep learning tasks.
no code implementations • CVPR 2021 • Liuyihan Song, Kang Zhao, Pan Pan, Yu Liu, Yingya Zhang, Yinghui Xu, Rong Jin
Different from all of them, we regard large and small gradients selection as the exploitation and exploration of gradient information, respectively.
no code implementations • 19 May 2021 • Yiming Chen, Kun Yuan, Yingya Zhang, Pan Pan, Yinghui Xu, Wotao Yin
Communication overhead hinders the scalability of large-scale distributed training.
1 code implementation • ICCV 2021 • Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, Wotao Yin
Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.
no code implementations • 9 Feb 2021 • Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, Rong Jin
We hope visual search at Alibaba becomes more widely incorporated into today's commercial applications.
no code implementations • 9 Feb 2021 • Kang Zhao, Pan Pan, Yun Zheng, Yanhao Zhang, Changxu Wang, Yingya Zhang, Yinghui Xu, Rong Jin
For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods.
no code implementations • 9 Feb 2021 • Kang Zhao, Sida Huang, Pan Pan, Yinghan Li, Yingya Zhang, Zhenyu Gu, Yinghui Xu
Researches have demonstrated that low bit-width (e. g., INT8) quantization can be employed to accelerate the inference process.
no code implementations • 9 Feb 2021 • Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, Yingya Zhang, Yinghui Xu, Rong Jin
In the last decades, extreme classification has become an essential topic for deep learning.