Exploring Plain Vision Transformer Backbones for Object Detection

30 Mar 2022  ·  Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He ·

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Instance Segmentation COCO minival ViTDet, ViT-H Cascade (multiscale) mask AP 53.1 # 8
Object Detection COCO minival ViTDet, ViT-H Cascade (multiscale) box AP 61.3 # 16
Instance Segmentation COCO minival ViTDet, ViT-H Cascade mask AP 52 # 14
Object Detection COCO minival ViTDet, ViT-H Cascade box AP 60.4 # 21
Object Detection COCO-O ViTDet (ViT-H) Average mAP 34.3 # 10
Effective Robustness 7.89 # 13
Instance Segmentation LVIS v1.0 val ViTDet-L mask AP 46.0 # 6
mask APr 34.3 # 3
Object Detection LVIS v1.0 val ViTDet-H box AP 53.4 # 6
Object Detection LVIS v1.0 val ViTDet-L box AP 51.2 # 7
Instance Segmentation LVIS v1.0 val ViTDet-H mask AP 48.1 # 5
mask APr 36.9 # 2

Methods