Vision Transformer Adapter for Dense Predictions

17 May 2022  ·  Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao ·

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

PDF Abstract

Results from the Paper


 Ranked #1 on Semantic Segmentation on Cityscapes test (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K ViT-Adapter-L (Mask2Former, BEiT pretrain) Validation mIoU 60.5 # 2
Semantic Segmentation ADE20K ViT-Adapter-L (UperNet, BEiT pretrain) Validation mIoU 58.4 # 5
Semantic Segmentation ADE20K val ViT-Adapter-L (Mask2Former, BEiT pretrain) mIoU 60.5 # 2
Semantic Segmentation ADE20K val ViT-Adapter-L (UperNet, BEiT pretrain) mIoU 58.4 # 4
Semantic Segmentation Cityscapes test ViT-Adapter-L (Mask2Former, BEiT pretrain, Mapillary) Mean IoU (class) 85.2% # 1
Semantic Segmentation Cityscapes val ViT-Adapter-L (Mask2Former, BEiT pretrain, Mapillary) mIoU 85.8 # 3
Instance Segmentation COCO minival ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) mask AP 51.7 # 6
Object Detection COCO minival ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) box AP 59.8 # 9
Semantic Segmentation COCO-Stuff test ViT-Adapter-L (Mask2Former, BEiT pretrain) mIoU 54.2% # 1
Semantic Segmentation COCO-Stuff test ViT-Adapter-L (UperNet, BEiT pretrain) mIoU 51.4% # 2
Instance Segmentation COCO test-dev ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) mask AP 52.1% # 5
Object Detection COCO test-dev ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) box AP 60.1 # 9
Semantic Segmentation PASCAL Context ViT-Adapter-L (UperNet, BEiT pretrain) mIoU 67.5 # 2
Semantic Segmentation PASCAL Context ViT-Adapter-L (Mask2Former, BEiT pretrain) mIoU 68.2 # 1

Methods