Understanding The Robustness in Vision Transformers

26 Apr 2022  ·  Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng, Jose M. Alvarez ·

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.

PDF Abstract

Results from the Paper


Ranked #4 on Domain Generalization on ImageNet-R (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation Cityscapes val FAN-L-Hybrid mIoU 82.3 # 32
Object Detection COCO minival FAN-L-Hybrid box AP 55.1 # 49
Semantic Segmentation DensePASS FAN (MiT-B1) mIoU 42.54% # 11
Image Classification ImageNet FAN-L-Hybrid++ Top 1 Accuracy 87.1% # 103
Number of params 76.8M # 802
Domain Generalization ImageNet-A FAN-Hybrid-L(IN-21K, 384) Top-1 accuracy % 74.5 # 7
Domain Generalization ImageNet-C FAN-B-Hybrid (IN-22k) mean Corruption Error (mCE) 41.0 # 14
Top 1 Accuracy 70.5 # 2
Number of params 50M # 30
Domain Generalization ImageNet-C FAN-L-Hybrid mean Corruption Error (mCE) 43.0 # 20
Top 1 Accuracy 67.7 # 4
Number of params 77M # 31
Domain Generalization ImageNet-C FAN-L-Hybrid (IN-22k) mean Corruption Error (mCE) 35.8 # 8
Top 1 Accuracy 73.6 # 1
Number of params 77M # 31
Domain Generalization ImageNet-R FAN-Hybrid-L(IN-21K, 384)) Top-1 Error Rate 28.9 # 4

Methods


No methods listed for this paper. Add relevant methods here