1 code implementation • 9 Apr 2025 • Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You
Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}.
1 code implementation • 9 Dec 2024 • Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, Gao Huang
Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs.
1 code implementation • CVPR 2025 • Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens.
Ranked #19 on
Visual Question Answering
on MM-Vet
1 code implementation • 11 Nov 2024 • Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan YAO, Gao Huang
At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens.
1 code implementation • 4 Nov 2024 • Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang
MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data.
no code implementations • 28 Oct 2024 • Yong Xien Chng, Xuchong Qiu, Yizeng Han, Yifan Pu, Jiewei Cao, Gao Huang
Recently, Mamba has emerged as a promising alternative, offering efficient long-range contextual modeling capabilities without the quadratic complexity associated with Transformer's attention mechanisms.
2 code implementations • 4 Oct 2024 • Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You
In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations.
no code implementations • 24 Sep 2024 • Yong Xien Chng, Xuchong Qiu, Yizeng Han, Kai Ding, Wan Ding, Gao Huang
Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.
1 code implementation • 22 Sep 2024 • Le Yang, Ziwei Zheng, Yizeng Han, Shiji Song, Gao Huang, Fan Li
Differentiable architecture search (DARTS) has emerged as a promising technique for effective neural architecture search, and it mainly contains two steps to find the high-performance architecture: First, the DARTS supernet that consists of mixed operations will be optimized via gradient descent.
1 code implementation • 11 Aug 2024 • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li
In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
1 code implementation • 29 Jul 2024 • Chaoqun Du, Yulin Wang, Jiayi Guo, Yizeng Han, Jie zhou, Gao Huang
To this end, we propose a Unified Test-Time Adaptation (UniTTA) benchmark, which is comprehensive and widely applicable.
1 code implementation • 3 Jul 2024 • Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li
Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations.
Ranked #3 on
Temporal Action Localization
on HACS
1 code implementation • 26 May 2024 • Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success.
1 code implementation • 14 May 2024 • Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang
These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation.
1 code implementation • 18 Mar 2024 • Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You
Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency.
no code implementations • 17 Mar 2024 • Jiangshan Wang, Yifan Pu, Yizeng Han, Jiayi Guo, Yiru Wang, Xiu Li, Gao Huang
GRA can adaptively capture fine-grained features of objects with diverse orientations, comprising two key components: Group-wise Rotating and Group-wise Attention.
1 code implementation • 21 Feb 2024 • Chaoqun Du, Yizeng Han, Gao Huang
Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched.
1 code implementation • CVPR 2024 • Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang
To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects.
1 code implementation • CVPR 2024 • Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Ranked #1 on
Generalized Referring Expression Segmentation
on gRefCOCO
(using extra training data)
2 code implementations • 14 Dec 2023 • Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan, Shiji Song, Gao Huang
Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module.
1 code implementation • 1 Sep 2023 • Yifan Pu, Yizeng Han, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang
Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories.
1 code implementation • 30 Aug 2023 • Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang
Dynamic computation has emerged as a promising avenue to enhance the inference efficiency of deep networks.
no code implementations • 27 Aug 2023 • Yulin Wang, Yizeng Han, Chaofei Wang, Shiji Song, Qi Tian, Gao Huang
Over the past decade, deep learning models have exhibited considerable advancements, reaching or even exceeding human-level performance in a range of visual perception tasks.
1 code implementation • ICCV 2023 • Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks.
1 code implementation • ICCV 2023 • Yizeng Han, Dongchen Han, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
1 code implementation • ICCV 2023 • Yifan Pu, Yiru Wang, Zhuofan Xia, Yizeng Han, Yulin Wang, Weihao Gan, Zidong Wang, Shiji Song, Gao Huang
In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images, and an efficient conditional computation mechanism is introduced to accommodate the large orientation variations of objects within an image.
Ranked #3 on
Oriented Object Detection
on DOTA 1.0
2 code implementations • 12 Oct 2022 • Yizeng Han, Zhihang Yuan, Yifan Pu, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang
The latency prediction model can efficiently estimate the inference latency of dynamic networks by simultaneously considering algorithms, scheduling strategies, and hardware properties.
1 code implementation • 17 Sep 2022 • Yizeng Han, Yifan Pu, Zihang Lai, Chaofei Wang, Shiji Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang
Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers.
no code implementations • ICCV 2021 • Chaofei Wang, Jiayu Xiao, Yizeng Han, Qisen Yang, Shiji Song, Gao Huang
The backbone of traditional CNN classifier is generally considered as a feature extractor, followed by a linear layer which performs the classification.
2 code implementations • ICCV 2021 • Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, Gao Huang
In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency.
no code implementations • 9 Feb 2021 • Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, Yulin Wang
Dynamic neural network is an emerging research topic in deep learning.
2 code implementations • CVPR 2020 • Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, Gao Huang
Adaptive inference is an effective mechanism to achieve a dynamic tradeoff between accuracy and computational cost in deep networks.