1 code implementation • ACL 2022 • Yan Liu, Sanyuan Chen, Yazheng Yang, Qi Dai
In this paper, we propose a multi-level Mutual Promotion mechanism for self-evolved Inference and sentence-level Interpretation (MPII).
no code implementations • 21 May 2025 • Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, Dongdong Chen, Zuxuan Wu, Chong Luo
Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals.
no code implementations • 1 May 2025 • Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh
We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth.
no code implementations • 23 Apr 2025 • Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.
no code implementations • 16 Apr 2025 • Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges.
no code implementations • 20 Mar 2025 • Quanhao Li, Zhen Xing, Rui Wang, HUI ZHANG, Qi Dai, Zuxuan Wu
However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.
no code implementations • 18 Mar 2025 • Yifei Dong, Fengyi Wu, Qi He, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Zhi-Qi Cheng, Alexander G Hauptmann
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments.
no code implementations • 14 Mar 2025 • Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu
We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks.
no code implementations • 28 Feb 2025 • Qi Dai, Beixiong Zheng, Qiyao Wang, Xue Xiong, Xiaodan Shao, Lipeng Zhu, Rui Zhang
Rotatable antenna (RA) represents a novel antenna architecture that enhances wireless communication system performance by independently or collectively adjusting each antenna's boresight/orientation.
no code implementations • CVPR 2025 • Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis.
no code implementations • CVPR 2025 • Ding Ding, Yueming Pan, Ruoyu Feng, Qi Dai, Kai Qiu, Jianmin Bao, Chong Luo, Zhenzhong Chen
In this paper, we present HomoGen, an enhanced video inpainting method based on homography propagation and diffusion models.
1 code implementation • 14 Dec 2024 • Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua
Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes.
1 code implementation • 5 Dec 2024 • Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, Chong Luo, Xin Geng, Baining Guo
Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated.
1 code implementation • CVPR 2025 • Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu
During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.
1 code implementation • 20 Nov 2024 • Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.
1 code implementation • 7 Nov 2024 • Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu
CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs.
1 code implementation • 27 Jun 2024 • Heng Li, Minghan Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions.
no code implementations • 13 Jun 2024 • Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo
In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system.
no code implementations • 10 Jun 2024 • Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation.
1 code implementation • 30 May 2024 • Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing.
no code implementations • 8 Apr 2024 • Qi Dai, Ryan Davis, Houlin Hong, Ying Gu
Class II preparation at 400{\mu}m tolerance had the smallest mean difference of 0. 41 points.
no code implementations • 23 Mar 2024 • Kesheng Wang, Kunhui Xu, Xiaoyu Chen, Chunlei He, Jianfeng Zhang, Dexing Kong, Qi Dai, Shoujun Huang
For improved segmentation of the pupil and tear meniscus areas, the convolutional neural network Inceptionv3 was first implemented as an image quality assessment model, effectively identifying higher-quality images with an accuracy of 98. 224%.
1 code implementation • CVPR 2024 • Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, Xian-Sheng Hua
To remedy this we propose a two-fold strategy: (1) We introduce an innovative approach that encodes bone connectivity by harnessing the power of graph distances to describe the physical topology; we further incorporate action-specific topological representation via persistent homology analysis to depict systemic dynamics.
Ranked #6 on
Skeleton Based Action Recognition
on NTU RGB+D 120
1 code implementation • CVPR 2024 • Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance.
no code implementations • 30 Nov 2023 • Zhen Xing, Qi Dai, Zihao Zhang, HUI ZHANG, Han Hu, Zuxuan Wu, Yu-Gang Jiang
Our model can edit and translate the desired results within seconds based on user instructions.
no code implementations • 30 Nov 2023 • Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong
Second, it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications.
1 code implementation • 16 Oct 2023 • Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain.
no code implementations • CVPR 2024 • Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1. 1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
1 code implementation • ICCV 2023 • Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang
While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.
Ranked #21 on
Action Classification
on Kinetics-400
1 code implementation • ICCV 2023 • Zhi-Qi Cheng, Qi Dai, SiYao Li, Jingdong Sun, Teruko Mitamura, Alexander G. Hauptmann
We evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks, demonstrating its superiority over existing methods.
no code implementations • 21 Feb 2023 • Yan Liu, Xiaokang Chen, Qi Dai
However, current works pursuing sentence-level explanations rely heavily on annotated training data, which limits the development of interpretability to only a few tasks.
1 code implementation • ICCV 2023 • Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu
With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well.
Ranked #27 on
Monocular Depth Estimation
on NYU-Depth V2
1 code implementation • CVPR 2023 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang
We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions.
1 code implementation • CVPR 2023 • Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
In this paper, we investigate the use of transformer models under the SSL setting for action recognition.
1 code implementation • 18 Aug 2022 • Zhi-Qi Cheng, Qi Dai, SiYao Li, Teruko Mitamura, Alexander G. Hauptmann
In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles.
1 code implementation • CVPR 2022 • Zhi-Qi Cheng, Qi Dai, Hong Li, Jingkuan Song, Xiao Wu, Alexander G. Hauptmann
We evaluate our methods on 4 mainstream object counting networks (i. e., MCNN, CSRNet, SANet, and ResNet-50).
Ranked #1 on
Object Counting
on TRANCOS
1 code implementation • CVPR 2023 • Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu
Our study reveals that: (i) Masked image modeling is also demanding on larger data.
1 code implementation • 30 May 2022 • Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, Qi Tian
A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e. g., ViT), albeit hierarchical vision transformers (e. g., Swin Transformer) have potentially better properties in formulating vision inputs.
no code implementations • 26 Apr 2022 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu-Gang Jiang
With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy.
no code implementations • 11 Jan 2022 • Qi Dai, Jian-wei Liu, Yang Liu
The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.
7 code implementations • CVPR 2022 • Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu
We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
Representation Learning
Self-Supervised Image Classification
+1
no code implementations • 13 Jun 2021 • Shaobo Min, Qi Dai, Hongtao Xie, Chuang Gan, Yongdong Zhang, Jingdong Wang
Cross-modal correlation provides an inherent supervision for video unsupervised representation learning.
1 code implementation • ICLR 2022 • Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang
Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window.
6 code implementations • 10 May 2021 • Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, Han Hu
We are witnessing a modeling shift from CNN to Transformers in computer vision.
Ranked #78 on
Self-Supervised Image Classification
on ImageNet
no code implementations • 5 May 2021 • Qi Dai, Di Shen, Jinhong Wang, Suzhou Huang, Dimitar Filev
Towards this end it is necessary that we have a comprehensive modeling framework for decision-making within which human driving preferences can be inferred statistically from observed driving behaviors in realistic and naturalistic traffic settings.
no code implementations • 10 Mar 2021 • Jisheng Li, Qi Dai, Jiangtao Wen
Consistent in-focus input imagery is an essential precondition for machine vision systems to perceive the dynamic environment.
no code implementations • ICCV 2021 • Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell, Huijuan Xu
We extensively benchmark against the baselines for SSAD and OSAD on our created data splits in THUMOS14 and ActivityNet1. 2, and demonstrate the effectiveness of the proposed UFA and IB methods.
no code implementations • 25 Sep 2020 • Qi Dai, Xunnong Xu, Wen Guo, Suzhou Huang, Dimitar Filev
To demonstrate how our approach can be applied to realistic traffic settings, we conduct a simulation experiment: to derive merging and yielding behaviors on a double-lane highway with an unexpected barrier.
1 code implementation • ICML 2020 • Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, Jingdong Wang
Specifically, we discriminate texture from shape based on local self-information in an image, and adopt a Dropout-like algorithm to decorrelate the model output from the local texture.
no code implementations • 24 Apr 2020 • Xingbo Liu, Xiushan Nie, Qi Dai, Yupan Huang, Yilong Yin
Due to the compelling efficiency in retrieval and storage, similarity-preserving hashing has been widely applied to approximate nearest neighbor search in large-scale image retrieval.
1 code implementation • CVPR 2020 • Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
Ranked #11 on
Weakly Supervised Action Localization
on ActivityNet-1.2
no code implementations • 9 Dec 2019 • Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc van Gool, Konrad Schindler
We present a self-supervised learning framework to estimate the individual object motion and monocular depth from video.
no code implementations • 17 Sep 2019 • Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Jun-Yan He, Alexander Hauptmann
By minimizing the mutual information, each column is guided to learn features with different image scales.
no code implementations • ICCV 2019 • Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Alexander Hauptmann
Although the Maximum Excess over SubArrays (MESA) loss has been previously proposed to address the above issues by finding the rectangular subregion whose predicted density map has the maximum difference from the ground truth, it cannot be solved by gradient descent, thus can hardly be integrated into the deep learning framework.
Ranked #5 on
Crowd Counting
on WorldExpo’10
1 code implementation • 16 Apr 2019 • Yupan Huang, Qi Dai, Yutong Lu
Each branch produces a set of action anchor layers by applying deconvolution to the feature maps of the main stream.
Ranked #28 on
Temporal Action Localization
on THUMOS’14
no code implementations • ECCV 2018 • Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, Tao Mei
The RTP initializes action proposals of the start frame through a Region Proposal Network and then estimates the movements of proposals in next frame in a recurrent manner.