Visual Prompt Tuning

The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Long-tail Learning CIFAR-100-LT (ρ=10) VPT Error Rate 10.4 # 3
Long-tail Learning CIFAR-100-LT (ρ=100) VPT Error Rate 19 # 4
Long-tail Learning CIFAR-100-LT (ρ=50) VPT Error Rate 15.2 # 3
Visual Prompt Tuning FGVC VPT-Shallow (ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 79.26 # 6
Visual Prompt Tuning FGVC VPT-Deep (ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 72.02 # 9
Visual Prompt Tuning FGVC VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 83.12 # 4
Visual Prompt Tuning FGVC VPT-Shallow (ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 57.84 # 10
Prompt Engineering ImageNet-21k VPT Accuracy 24.8 # 2
Visual Prompt Tuning VTAB-1k(Natural<7>) VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 67.34 # 5
Visual Prompt Tuning VTAB-1k(Natural<7>) VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 70.27 # 4
Visual Prompt Tuning VTAB-1k(Natural<7>) VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 36.02 # 10
Visual Prompt Tuning VTAB-1k(Natural<7>) VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 39.96 # 9
Visual Prompt Tuning VTAB-1k(Specialized<4>) VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 83.04 # 5
Visual Prompt Tuning VTAB-1k(Specialized<4>) VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 69.65 # 9
Visual Prompt Tuning VTAB-1k(Specialized<4>) VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 60.61 # 10
Visual Prompt Tuning VTAB-1k(Specialized<4>) VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 82.26 # 6
Visual Prompt Tuning VTAB-1k(Structured<8>) VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 37.55 # 7
Visual Prompt Tuning VTAB-1k(Structured<8>) VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 27.50 # 9
Visual Prompt Tuning VTAB-1k(Structured<8>) VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) Mean Accuracy 26.57 # 10
Visual Prompt Tuning VTAB-1k(Structured<8>) VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) Mean Accuracy 42.38 # 6

Methods