TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	mask AP	35.5	# 22
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	AP50	59.3	# 21
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	AP75	38.3	# 19
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	AR1	16.6	# 16
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	AR10	39.8	# 19
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	mask AP	23.7	# 31
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	AP50	47.6	# 29
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	AP75	20.8	# 31
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	AR1	12.0	# 25
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	AR10	28.9	# 25
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	mask AP	54.4	# 15
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	AP50	77.7	# 14
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	AP75	59.8	# 15
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	AR10	57.8	# 16
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	AR1	43.8	# 16
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	mask AP	43.1	# 22
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	AP50	66.8	# 22
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	AP75	46.6	# 22
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	AR10	50.1	# 22
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	AR1	38.0	# 22
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	mask AP	57.1	# 18
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	AP50	80.8	# 16
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	AP75	66.3	# 15
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	AR1	50.8	# 15
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	AR10	61.0	# 15
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	mask AP	44.4	# 31
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	AP50	66.7	# 29
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	AP75	48.6	# 29
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	AR1	42.4	# 24
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	AR10	51.6	# 25

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/devis-making-deformable-transformers-work-for/video-instance-segmentation-on-youtube-vis-2)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-2?p=devis-making-deformable-transformers-work-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/devis-making-deformable-transformers-work-for/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=devis-making-deformable-transformers-work-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/devis-making-deformable-transformers-work-for/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=devis-making-deformable-transformers-work-for)`

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

22 Jul 2022 · Adrià Caelles, Tim Meinhardt, Guillem Brasó, Laura Leal-Taixé ·

Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset. Code is available at https://github.com/acaelles97/DeVIS.

PDF Abstract

Code

Add Remove Mark official

acaelles97/devis official

Tasks

Add Remove

Instance Segmentation

object-detection

Object Detection

Segmentation

Semantic Segmentation

Video Instance Segmentation

Datasets

MS COCO

YouTube-VIS 2019

OVIS YouTube-VIS 2021

Results from the Paper

Edit

Ranked #15 on Video Instance Segmentation on YouTube-VIS 2021

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Instance Segmentation	OVIS validation	DeVIS (Swin-L)	mask AP	35.5	# 22	Compare
			AP50	59.3	# 21	Compare
			AP75	38.3	# 19	Compare
			AR1	16.6	# 16	Compare
			AR10	39.8	# 19	Compare
Video Instance Segmentation	OVIS validation	DeVIS (ResNet-50)	mask AP	23.7	# 31	Compare
			AP50	47.6	# 29	Compare
			AP75	20.8	# 31	Compare
			AR1	12.0	# 25	Compare
			AR10	28.9	# 25	Compare
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (Swin-L)	mask AP	54.4	# 15	Compare
			AP50	77.7	# 14	Compare
			AP75	59.8	# 15	Compare
			AR10	57.8	# 16	Compare
			AR1	43.8	# 16	Compare
Video Instance Segmentation	YouTube-VIS 2021	DeVIS (ResNet-50)	mask AP	43.1	# 22	Compare
			AP50	66.8	# 22	Compare
			AP75	46.6	# 22	Compare
			AR10	50.1	# 22	Compare
			AR1	38.0	# 22	Compare
Video Instance Segmentation	YouTube-VIS validation	DeVIS (Swin-L)	mask AP	57.1	# 18	Compare
			AP50	80.8	# 16	Compare
			AP75	66.3	# 15	Compare
			AR1	50.8	# 15	Compare
			AR10	61.0	# 15	Compare
Video Instance Segmentation	YouTube-VIS validation	DeVIS (ResNet-50)	mask AP	44.4	# 31	Compare
			AP50	66.7	# 29	Compare
			AP75	48.6	# 29	Compare
			AR1	42.4	# 24	Compare
			AR10	51.6	# 25	Compare

Methods

Add Remove

CLIP

Edit Social Preview

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove