1 code implementation • 22 Aug 2024 • Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, WeiHao Wang, Kevin Qinghong Lin, YuChao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
We present a unified transformer, i. e., Show-o, that unifies multimodal understanding and generation.
1 code implementation • 12 Mar 2024 • Weijia Wu, Zhuang Li, YuChao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.
no code implementations • 15 Jan 2024 • Jay Zhangjie Wu, Guian Fang, HaoNing Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, YuChao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
1 code implementation • 3 Jan 2024 • David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo
This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text.
no code implementations • CVPR 2024 • YuChao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang
In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape.
1 code implementation • 12 Oct 2023 • Rui Zhao, YuChao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion.
1 code implementation • 27 Sep 2023 • David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
no code implementations • 14 Sep 2023 • David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set.
no code implementations • 13 Aug 2023 • David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy.
2 code implementations • ICCV 2023 • Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou
Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).
1 code implementation • CVPR 2023 • Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou
In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.
1 code implementation • ICCV 2023 • Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, Mike Zheng Shou
Remarkably, with only 25% annotated video frames, our method still outperforms the base CL learners, which are trained with 100% annotations on all video frames.
1 code implementation • 31 May 2022 • Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, XiaoHu Qie, Mike Zheng Shou
In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields.
no code implementations • CVPR 2022 • Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, Yu Qiao
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition.
2 code implementations • 24 Nov 2021 • David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou
With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.
Ranked #39 on Action Recognition on Something-Something V2 (using extra training data)