Context-Aware Robust Fine-Tuning

29 Nov 2022  ·  Xiaofeng Mao, Yuefeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, Zhao Li ·

Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.

PDF Abstract
No code implementations yet. Submit your code now

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Domain Generalization DomainNet CAR-FT (CLIP, ViT-B/16) Average Accuracy 62.5 # 2
Domain Generalization ImageNet-A CAR-FT (CLIP, ViT-L/14@336px) Top-1 accuracy % 81.5 # 4
Domain Generalization ImageNet-R CAR-FT (CLIP, ViT-L/14@336px) Top-1 Error Rate 10.3 # 3
Domain Generalization ImageNet-Sketch CAR-FT (CLIP, ViT-L/14@336px) Top-1 accuracy 65.5 # 3
Domain Generalization Office-Home CAR-FT (CLIP, ViT-B/16) Average Accuracy 85.7 # 3
Domain Generalization PACS CAR-FT (CLIP, ViT-B/16) Average Accuracy 96.8 # 4
Domain Generalization TerraIncognita CAR-FT (CLIP, ViT-B/16) Average Accuracy 61.9 # 2
Domain Generalization VLCS CAR-FT (CLIP, ViT-B/16) Average Accuracy 85.5 # 1