ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO 2017 val ViDT Swin-base AP 49.2 # 12
AP50 69.4 # 4
AP75 53.1 # 6
APS 30.6 # 4
APM 52.6 # 4
APL 66.9 # 3
Param. 0.1B # 1
Object Detection COCO 2017 val ViDT Swin-nano AP 40.4 # 24
AP50 59.6 # 10
AP75 43.3 # 9
APS 23.2 # 7
APM 42.5 # 7
APL 55.8 # 8
Param. 16M # 22
Object Detection COCO 2017 val ViDT Swin-small AP 47.5 # 17
AP50 67.7 # 6
AP75 51.4 # 7
APS 29.2 # 5
APM 50.7 # 5
APL 64.8 # 4
Param. 61M # 25
Object Detection COCO 2017 val ViDT Swin-tiny AP 44.8 # 20
AP50 64.5 # 9
AP75 48.7 # 8
APS 25.9 # 6
APM 47.6 # 6
APL 62.1 # 7
Param. 38M # 23

Methods