ViT-NeT: Interpretable Vision Transformers with Neural Tree Decoder

Vision transformers (ViTs), which have demonstrated a state-of-the-art performance in image classification, can also visualize global interpretations through attention-based contributions. How- ever, the complexity of the model makes it difficult to interpret the decision-making process, and the ambiguity of the attention maps can cause incorrect correlations between image patches. In this study, we propose a new ViT neural tree decoder (ViT-NeT). A ViT acts as a backbone, and to solve its limitations, the output contex- tual image patches are applied to the proposed NeT. The NeT aims to accurately classify fine-grained objects with similar inter-class correlations and different intra-class correlations. In addition, it describes the decision-making process through a tree structure and prototype and en- ables a visual interpretation of the results. The proposed ViT-NeT is designed to not only improve the classification performance but also provide a human-friendly interpretation, which is effective in resolving the trade-off between performance and interpretability. We compared the performance of ViT-NeT with other state-of-art methods using widely used fine-grained visual categorization benchmark datasets and experimentally proved that the proposed method is superior in terms of the classification performance and interpretability. The code and models are publicly available at https://github.com/jumpsnack/ViT-NeT.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Fine-Grained Image Classification CUB-200-2011 ViT-NeT (SwinV2-B) Accuracy 91.7% # 12
Fine-Grained Image Classification Stanford Cars ViT-NeT (SwinV2-B) Accuracy 95.0% # 19
Fine-Grained Image Classification Stanford Dogs ViT-NeT (DeiT-III-B) Accuracy 93.6% # 2

Methods


No methods listed for this paper. Add relevant methods here