Search Results for author: Qi Dai

Found 56 papers, 26 papers with code

MPII: Multi-Level Mutual Promotion for Inference and Interpretation

1 code implementation ACL 2022 Yan Liu, Sanyuan Chen, Yazheng Yang, Qi Dai

In this paper, we propose a multi-level Mutual Promotion mechanism for self-evolved Inference and sentence-level Interpretation (MPII).

Sentence

Subject-driven Video Generation via Disentangled Identity and Motion

no code implementations23 Apr 2025 Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.

Subject-driven Video Generation Video Generation

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

no code implementations16 Apr 2025 Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng

Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges.

Benchmarking Language Modeling +2

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

no code implementations20 Mar 2025 Quanhao Li, Zhen Xing, Rui Wang, HUI ZHANG, Qi Dai, Zuxuan Wu

However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.

Image to Video Generation Object

HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard

no code implementations18 Mar 2025 Yifei Dong, Fengyi Wu, Qi He, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Zhi-Qi Cheng, Alexander G Hauptmann

Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments.

Benchmarking Human Dynamics +1

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

no code implementations14 Mar 2025 Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu

We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks.

Text-to-Video Generation Video Generation

A Demo of Radar Sensing Aided Rotatable Antenna for Wireless Communication System

no code implementations28 Feb 2025 Qi Dai, Beixiong Zheng, Qiyao Wang, Xue Xiong, Xiaodan Shao, Lipeng Zhu, Rui Zhang

Rotatable antenna (RA) represents a novel antenna architecture that enhances wireless communication system performance by independently or collectively adjusting each antenna's boresight/orientation.

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

no code implementations CVPR 2025 Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho

To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis.

Motion Synthesis Optical Flow Estimation +1

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

1 code implementation14 Dec 2024 Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua

Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes.

Retrieval

MageBench: Bridging Large Multimodal Models to Agents

1 code implementation5 Dec 2024 Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, Chong Luo, Xin Geng, Baining Guo

Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated.

Sokoban

StableAnimator: High-Quality Identity-Preserving Human Image Animation

1 code implementation CVPR 2025 Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu

During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.

Denoising Face Reenactment +3

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

1 code implementation20 Nov 2024 Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.

Video Generation

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

1 code implementation7 Nov 2024 Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu

CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs.

Contrastive Learning Image Captioning +6

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

no code implementations13 Jun 2024 Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo

In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system.

Retrieval

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

no code implementations10 Jun 2024 Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang

Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation.

Language Modelling Large Language Model +1

Effectiveness of Self-Assessment Software to Evaluate Preclinical Operative Procedures

no code implementations8 Apr 2024 Qi Dai, Ryan Davis, Houlin Hong, Ying Gu

Class II preparation at 400{\mu}m tolerance had the smallest mean difference of 0. 41 points.

An edge detection-based deep learning approach for tear meniscus height measurement

no code implementations23 Mar 2024 Kesheng Wang, Kunhui Xu, Xiaoyu Chen, Chunlei He, Jianfeng Zhang, Dexing Kong, Qi Dai, Shoujun Huang

For improved segmentation of the pupil and tear meniscus areas, the convolutional neural network Inceptionv3 was first implemented as an image quality assessment model, effectively identifying higher-quality images with an accuracy of 98. 224%.

Edge Detection Image Quality Assessment

BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition

1 code implementation CVPR 2024 Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, Xian-Sheng Hua

To remedy this we propose a two-fold strategy: (1) We introduce an innovative approach that encodes bone connectivity by harnessing the power of graph distances to describe the physical topology; we further incorporate action-specific topological representation via persistent homology analysis to depict systemic dynamics.

Action Recognition Skeleton Based Action Recognition

MotionEditor: Editing Video Motion via Content-Aware Diffusion

1 code implementation CVPR 2024 Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance.

Video Editing

A Survey on Video Diffusion Models

1 code implementation16 Oct 2023 Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain.

Image Generation Survey +3

SimDA: Simple Diffusion Adapter for Efficient Video Generation

no code implementations CVPR 2024 Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang

In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1. 1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.

Transfer Learning Video Editing +2

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

1 code implementation ICCV 2023 Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang

While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.

Action Classification Action Recognition +1

Parallel Sentence-Level Explanation Generation for Real-World Low-Resource Scenarios

no code implementations21 Feb 2023 Yan Liu, Xiaokang Chen, Qi Dai

However, current works pursuing sentence-level explanations rely heavily on annotated training data, which limits the development of interpretability to only a few tasks.

Explanation Generation Natural Language Inference +2

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

1 code implementation ICCV 2023 Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu

With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well.

All Instance Segmentation +2

ResFormer: Scaling ViTs with Multi-Resolution Training

1 code implementation CVPR 2023 Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang

We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions.

Action Recognition image-classification +5

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

1 code implementation30 May 2022 Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, Qi Tian

A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e. g., ViT), albeit hierarchical vision transformers (e. g., Swin Transformer) have potentially better properties in formulating vision inputs.

Transfer Learning

Deeper Insights into the Robustness of ViTs towards Common Corruptions

no code implementations26 Apr 2022 Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu-Gang Jiang

With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy.

Benchmarking Data Augmentation

Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data

no code implementations11 Jan 2022 Qi Dai, Jian-wei Liu, Yang Liu

The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy.

Classification imbalanced classification

SimMIM: A Simple Framework for Masked Image Modeling

7 code implementations CVPR 2022 Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu

We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.

Representation Learning Self-Supervised Image Classification +1

On the Connection between Local Attention and Dynamic Depth-wise Convolution

1 code implementation ICLR 2022 Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang

Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window.

object-detection Object Detection +2

Calibration of Human Driving Behavior and Preference Using Naturalistic Traffic Data

no code implementations5 May 2021 Qi Dai, Di Shen, Jinhong Wang, Suzhou Huang, Dimitar Filev

Towards this end it is necessary that we have a comprehensive modeling framework for decision-making within which human driving preferences can be inferred statistically from observed driving behaviors in realistic and naturalistic traffic settings.

Autonomous Vehicles Decision Making

Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

no code implementations10 Mar 2021 Jisheng Li, Qi Dai, Jiangtao Wen

Consistent in-focus input imagery is an essential precondition for machine vision systems to perceive the dynamic environment.

Temporal Action Detection with Multi-level Supervision

no code implementations ICCV 2021 Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell, Huijuan Xu

We extensively benchmark against the baselines for SSAD and OSAD on our created data splits in THUMOS14 and ActivityNet1. 2, and demonstrate the effectiveness of the proposed UFA and IB methods.

Action Detection Semi-Supervised Action Detection

Towards a Systematic Computational Framework for Modeling Multi-Agent Decision-Making at Micro Level for Smart Vehicles in a Smart World

no code implementations25 Sep 2020 Qi Dai, Xunnong Xu, Wen Guo, Suzhou Huang, Dimitar Filev

To demonstrate how our approach can be applied to realistic traffic settings, we conduct a simulation experiment: to derive merging and yielding behaviors on a double-lane highway with an unexpected barrier.

Autonomous Vehicles Computational Efficiency +1

Informative Dropout for Robust Representation Learning: A Shape-bias Perspective

1 code implementation ICML 2020 Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, Jingdong Wang

Specifically, we discriminate texture from shape based on local self-information in an image, and adopt a Dropout-like algorithm to decorrelate the model output from the local texture.

Domain Generalization Representation Learning

Reinforcing Short-Length Hashing

no code implementations24 Apr 2020 Xingbo Liu, Xiushan Nie, Qi Dai, Yupan Huang, Yilong Yin

Due to the compelling efficiency in retrieval and storage, similarity-preserving hashing has been widely applied to approximate nearest neighbor search in large-scale image retrieval.

Image Retrieval Retrieval

Self-supervised Object Motion and Depth Estimation from Video

no code implementations9 Dec 2019 Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc van Gool, Konrad Schindler

We present a self-supervised learning framework to estimate the individual object motion and monocular depth from video.

Depth Estimation Instance Segmentation +6

Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting

no code implementations17 Sep 2019 Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Jun-Yan He, Alexander Hauptmann

By minimizing the mutual information, each column is guided to learn features with different image scales.

Crowd Counting

Learning Spatial Awareness to Improve Crowd Counting

no code implementations ICCV 2019 Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Alexander Hauptmann

Although the Maximum Excess over SubArrays (MESA) loss has been previously proposed to address the above issues by finding the rectangular subregion whose predicted density map has the maximum difference from the ground truth, it cannot be solved by gradient descent, thus can hardly be integrated into the deep learning framework.

Crowd Counting Weakly-supervised Learning

Decoupling Localization and Classification in Single Shot Temporal Action Detection

1 code implementation16 Apr 2019 Yupan Huang, Qi Dai, Yutong Lu

Each branch produces a set of action anchor layers by applying deconvolution to the feature maps of the main stream.

Action Detection Classification +2

Recurrent Tubelet Proposal and Recognition Networks for Action Detection

no code implementations ECCV 2018 Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, Tao Mei

The RTP initializes action proposals of the start frame through a Region Proposal Network and then estimates the movements of proposals in next frame in a recurrent manner.

Action Detection Region Proposal

Cannot find the paper you are looking for? You can Submit a new open access paper.