Optimizing Relevance Maps of Vision Transformers Improves Robustness

2 Jun 2022  ·  Hila Chefer, Idan Schwartz, Lior Wolf ·

It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Out-of-Distribution Generalization ImageNet-W RobustViT IN-W Gap -7.3 # 1
Carton Gap +34 # 1
Image Classification ObjectNet AR-L (Opt Relevance) Top-5 Accuracy 73.5 # 3
Top-1 Accuracy 52.0 # 24
Image Classification ObjectNet ViT-L Top-5 Accuracy 59.5 # 12
Top-1 Accuracy 37.4 # 48
Image Classification ObjectNet AR-S (Opt Relevance) Top-5 Accuracy 61.7 # 10
Top-1 Accuracy 39.3 # 45
Image Classification ObjectNet AR-B Top-5 Accuracy 63.7 # 9
Top-1 Accuracy 41.4 # 42
Image Classification ObjectNet ViT-B (Opt Relevance) Top-5 Accuracy 65.1 # 8
Top-1 Accuracy 42.2 # 40
Image Classification ObjectNet DeiT-L (Opt Relevance) Top-5 Accuracy 56.6 # 14
Top-1 Accuracy 36.3 # 49
Image Classification ObjectNet DeiT-L Top-5 Accuracy 48.5 # 26
Top-1 Accuracy 31.4 # 62
Image Classification ObjectNet DeiT-S (Opt Relevance) Top-5 Accuracy 53 # 19
Top-1 Accuracy 31.6 # 60
Image Classification ObjectNet DeiT-S Top-5 Accuracy 47.3 # 28
Top-1 Accuracy 28.3 # 76
Image Classification ObjectNet AR-L Top-5 Accuracy 68.3 # 6
Top-1 Accuracy 46.5 # 36
Image Classification ObjectNet AR-B (Opt Relevance) Top-5 Accuracy 70 # 4
Top-1 Accuracy 47.1 # 32
Image Classification ObjectNet AR-S Top-5 Accuracy 55.8 # 17
Top-1 Accuracy 34.3 # 56
Image Classification ObjectNet ViT-L (Opt Relevance) Top-5 Accuracy 65.8 # 7
Top-1 Accuracy 43.2 # 37
Image Classification ObjectNet ViT-B Top-5 Accuracy 56.4 # 15
Top-1 Accuracy 35.1 # 54

Methods