Exploring Plain Vision Transformer Backbones for Object Detection

30 Mar 2022  ·  Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He ·

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Cross-Domain Few-Shot Object Detection Artaxor ViTDeT-FT mAP 23.4 # 5
Cross-Domain Few-Shot Object Detection Clipark1k ViTDeT-FT mAP 25.6 # 1
Object Detection COCO minival ViTDet, ViT-H Cascade box AP 60.4 # 21
Instance Segmentation COCO minival ViTDet, ViT-H Cascade (multiscale) mask AP 53.1 # 10
Object Detection COCO minival ViTDet, ViT-H Cascade (multiscale) box AP 61.3 # 16
Instance Segmentation COCO minival ViTDet, ViT-H Cascade mask AP 52 # 16
Object Detection COCO-O ViTDet (ViT-H) Average mAP 34.3 # 10
Object Detection COCO-O ViTDet (ViT-H) Effective Robustness 7.89 # 13
Cross-Domain Few-Shot Object Detection DeepFish ViTDeT-FT mAP 6.5 # 4
Cross-Domain Few-Shot Object Detection DIOR ViTDeT-FT mAP 29.4 # 3
Object Detection LVIS v1.0 val ViTDet-L box AP 51.2 # 8
Instance Segmentation LVIS v1.0 val ViTDet-L mask AP 46.0 # 6
mask APr 34.3 # 4
Object Detection LVIS v1.0 val ViTDet-H box AP 53.4 # 7
Instance Segmentation LVIS v1.0 val ViTDet-H mask AP 48.1 # 5
mask APr 36.9 # 3
Cross-Domain Few-Shot Object Detection NEU-DET ViTDeT-FT mAP 15.8 # 2
Cross-Domain Few-Shot Object Detection UODD ViTDeT-FT mAP 15.8 # 4

Methods