1 code implementation • 22 May 2024 • Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He
The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess.
1 code implementation • 22 May 2024 • Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He
In recent years, Transformers have achieved remarkable progress in computer vision tasks.
no code implementations • 9 Apr 2024 • Zhida Zhang, Jie Cao, Wenkui Yang, Qihang Fan, Kai Zhou, Ran He
The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets. Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity. To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting. Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances.
no code implementations • 27 Mar 2024 • Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration.
no code implementations • 8 Oct 2023 • Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task.
1 code implementation • 8 Oct 2023 • Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo Huang, Ran He, Hongxia Yang
Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks.
1 code implementation • CVPR 2024 • Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He
To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes.
1 code implementation • NeurIPS 2023 • Qihang Fan, Huaibo Huang, Xiaoqiang Zhou, Ran He
This paper proposes a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways.
1 code implementation • 31 Mar 2023 • Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He
The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information.
Ranked #579 on Image Classification on ImageNet