TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO 2017 val	ViDT Swin-base	AP	49.2	# 12
Object Detection	COCO 2017 val	ViDT Swin-base	AP50	69.4	# 4
Object Detection	COCO 2017 val	ViDT Swin-base	AP75	53.1	# 6
Object Detection	COCO 2017 val	ViDT Swin-base	APS	30.6	# 4
Object Detection	COCO 2017 val	ViDT Swin-base	APM	52.6	# 4
Object Detection	COCO 2017 val	ViDT Swin-base	APL	66.9	# 3
Object Detection	COCO 2017 val	ViDT Swin-base	Param.	0.1B	# 1
Object Detection	COCO 2017 val	ViDT Swin-nano	AP	40.4	# 24
Object Detection	COCO 2017 val	ViDT Swin-nano	AP50	59.6	# 10
Object Detection	COCO 2017 val	ViDT Swin-nano	AP75	43.3	# 9
Object Detection	COCO 2017 val	ViDT Swin-nano	APS	23.2	# 7
Object Detection	COCO 2017 val	ViDT Swin-nano	APM	42.5	# 7
Object Detection	COCO 2017 val	ViDT Swin-nano	APL	55.8	# 8
Object Detection	COCO 2017 val	ViDT Swin-nano	Param.	16M	# 22
Object Detection	COCO 2017 val	ViDT Swin-small	AP	47.5	# 17
Object Detection	COCO 2017 val	ViDT Swin-small	AP50	67.7	# 6
Object Detection	COCO 2017 val	ViDT Swin-small	AP75	51.4	# 7
Object Detection	COCO 2017 val	ViDT Swin-small	APS	29.2	# 5
Object Detection	COCO 2017 val	ViDT Swin-small	APM	50.7	# 5
Object Detection	COCO 2017 val	ViDT Swin-small	APL	64.8	# 4
Object Detection	COCO 2017 val	ViDT Swin-small	Param.	61M	# 25
Object Detection	COCO 2017 val	ViDT Swin-tiny	AP	44.8	# 20
Object Detection	COCO 2017 val	ViDT Swin-tiny	AP50	64.5	# 9
Object Detection	COCO 2017 val	ViDT Swin-tiny	AP75	48.7	# 8
Object Detection	COCO 2017 val	ViDT Swin-tiny	APS	25.9	# 6
Object Detection	COCO 2017 val	ViDT Swin-tiny	APM	47.6	# 6
Object Detection	COCO 2017 val	ViDT Swin-tiny	APL	62.1	# 7
Object Detection	COCO 2017 val	ViDT Swin-tiny	Param.	38M	# 23

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vidt-an-efficient-and-effective-fully/object-detection-on-coco-2017-val)](https://paperswithcode.com/sota/object-detection-on-coco-2017-val?p=vidt-an-efficient-and-effective-fully)`

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

ICLR 2022 · Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han, Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang ·

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

naver-ai/vidt official

299

Tasks

Add Remove

Image Classification

Object

object-detection

Object Detection

Datasets

MS COCO

Results from the Paper

Edit

Ranked #12 on Object Detection on COCO 2017 val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO 2017 val	ViDT Swin-base	AP	49.2	# 12	Compare
			AP50	69.4	# 4	Compare
			AP75	53.1	# 6	Compare
			APS	30.6	# 4	Compare
			APM	52.6	# 4	Compare
			APL	66.9	# 3	Compare
			Param.	0.1B	# 1	Compare
Object Detection	COCO 2017 val	ViDT Swin-nano	AP	40.4	# 24	Compare
			AP50	59.6	# 10	Compare
			AP75	43.3	# 9	Compare
			APS	23.2	# 7	Compare
			APM	42.5	# 7	Compare
			APL	55.8	# 8	Compare
			Param.	16M	# 22	Compare
Object Detection	COCO 2017 val	ViDT Swin-small	AP	47.5	# 17	Compare
			AP50	67.7	# 6	Compare
			AP75	51.4	# 7	Compare
			APS	29.2	# 5	Compare
			APM	50.7	# 5	Compare
			APL	64.8	# 4	Compare
			Param.	61M	# 25	Compare
Object Detection	COCO 2017 val	ViDT Swin-tiny	AP	44.8	# 20	Compare
			AP50	64.5	# 9	Compare
			AP75	48.7	# 8	Compare
			APS	25.9	# 6	Compare
			APM	47.6	# 6	Compare
			APL	62.1	# 7	Compare
			Param.	38M	# 23	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer

Edit Social Preview

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove