Vision Transformer Adapter for Dense Predictions

17 May 2022  ·  Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao ·

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K ViT-Adapter-L (UperNet, BEiT pretrain) Validation mIoU 58.4 # 14
Params (M) 451 # 11
Semantic Segmentation ADE20K ViT-Adapter-L (Mask2Former, BEiT pretrain) Validation mIoU 60.5 # 10
Params (M) 571 # 8
Semantic Segmentation ADE20K ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) Validation mIoU 61.5 # 6
Params (M) 571 # 8
Semantic Segmentation ADE20K val ViT-Adapter-L (Mask2Former, BEiT pretrain) mIoU 60.5 # 7
Semantic Segmentation ADE20K val ViT-Adapter-L (UperNet, BEiT pretrain) mIoU 58.4 # 9
Semantic Segmentation Cityscapes test ViT-Adapter-L (Mask2Former, BEiT pretrain) Mean IoU (class) 85.2% # 6
Instance Segmentation COCO minival ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) mask AP 54.2 # 5
Panoptic Segmentation COCO minival ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) PQ 58.4 # 5
PQth 65.0 # 2
PQst 48.4 # 6
AP 48.9 # 7
Instance Segmentation COCO minival ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) mask AP 52.5 # 11
Instance Segmentation COCO minival ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) mask AP 52.2 # 13
Object Detection COCO minival ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) box AP 60.5 # 20
Object Detection COCO minival ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) box AP 60.2 # 24
Object Detection COCO-O ViT-Adapter (BEiTv2-L) Average mAP 34.25 # 11
Effective Robustness 7.79 # 14
Instance Segmentation COCO test-dev ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) mask AP 54.5 # 6
Object Detection COCO test-dev ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) box mAP 60.4 # 26
Instance Segmentation COCO test-dev ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) mask AP 53.0 # 10
Object Detection COCO test-dev ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) box mAP 60.9 # 23
Instance Segmentation COCO test-dev ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) mask AP 52.5 # 13
Semantic Segmentation PASCAL Context ViT-Adapter-L (UperNet, BEiT pretrain) mIoU 67.5 # 5
Semantic Segmentation PASCAL Context ViT-Adapter-L (Mask2Former, BEiT pretrain) mIoU 68.2 # 4

Methods