no code implementations • 24 Mar 2022 • Zi-Chao Zhang, Zhen-Duo Chen, Yongxin Wang, Xin Luo, Xin-Shun Xu
Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC). These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks. However, there are some limitations when applying ViT directly to FGVC. First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects. Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information.
no code implementations • 1 Feb 2024 • Li-Jun Zhao, Zhen-Duo Chen, Zi-Chao Zhang, Xin Luo, Xin-Shun Xu
Some recent methods somewhat alleviate the accuracy imbalance between base and incremental classes by fine-tuning the feature extractor in the incremental sessions, but they further cause the accuracy imbalance between past and current incremental classes.