TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	mask AP	22.8	# 32
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	AP50	42.5	# 30
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	AP75	21.61	# 30
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	AR1	12.9	# 24
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	AR10	29.3	# 24
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	mask AP	20.0	# 33
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	AP50	40.7	# 31
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	AP75	18.1	# 32
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	AR1	12	# 25
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	AR10	27.1	# 26
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	mask AP	40.8	# 23
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	AP50	62.4	# 23
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	AP75	43.7	# 23
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	AR10	48.1	# 23
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	AR1	36.1	# 23
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	mask AP	51.0	# 17
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	AP50	73.7	# 17
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	AP75	56.9	# 17
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	AR10	56.0	# 18
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	AR1	42.8	# 17
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	mAP_L	24.8	# 6
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	AP50_L	49.5	# 3
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	AP75_L	26.7	# 4
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	AR1_L	23.9	# 4
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	AR10_L	30.1	# 3
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	mAP_L	26.3	# 5
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	AP50_L	44.6	# 4
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	AP75_L	27.3	# 3
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	AR1_L	25.0	# 3
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	AR10_L	29.2	# 4
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	mask AP	45.6	# 29
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	AP50	68.6	# 26
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	AP75	49.6	# 28
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	AR1	42.1	# 25
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	AR10	53.5	# 24
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	mask AP	56.3	# 19
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	AP50	78.0	# 18
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	AP75	64.2	# 18
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	AR1	50.9	# 14
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	AR10	61.6	# 14

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instanceformer-an-online-video-instance/video-instance-segmentation-on-youtube-vis-3)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-3?p=instanceformer-an-online-video-instance)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instanceformer-an-online-video-instance/video-instance-segmentation-on-youtube-vis-2)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-2?p=instanceformer-an-online-video-instance)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instanceformer-an-online-video-instance/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=instanceformer-an-online-video-instance)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instanceformer-an-online-video-instance/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=instanceformer-an-online-video-instance)`

InstanceFormer: An Online Video Instance Segmentation Framework

22 Aug 2022 · Rajat Koner, Tanveer Hannan, Suprosanna Shit, Sahand Sharifzadeh, Matthias Schubert, Thomas Seidl, Volker Tresp ·

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

PDF Abstract

Code

Add Remove Mark official

rajatkoner08/instanceformer official

Tasks

Add Remove

Instance Segmentation

Semantic Segmentation

Video Instance Segmentation

Datasets

YouTube-VIS 2019

OVIS YouTube-VIS 2021

Youtube-VIS 2022 Validation

Results from the Paper

Edit

Ranked #5 on Video Instance Segmentation on Youtube-VIS 2022 Validation (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Instance Segmentation	OVIS validation	InstanceFormer (Swin-L)	mask AP	22.8	# 32	Compare
			AP50	42.5	# 30	Compare
			AP75	21.61	# 30	Compare
			AR1	12.9	# 24	Compare
			AR10	29.3	# 24	Compare
Video Instance Segmentation	OVIS validation	InstanceFormer(ResNet-50)	mask AP	20.0	# 33	Compare
			AP50	40.7	# 31	Compare
			AP75	18.1	# 32	Compare
			AR1	12	# 25	Compare
			AR10	27.1	# 26	Compare
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (ResNet-50)	mask AP	40.8	# 23	Compare
			AP50	62.4	# 23	Compare
			AP75	43.7	# 23	Compare
			AR10	48.1	# 23	Compare
			AR1	36.1	# 23	Compare
Video Instance Segmentation	YouTube-VIS 2021	InstanceFormer (Swin-L)	mask AP	51.0	# 17	Compare
			AP50	73.7	# 17	Compare
			AP75	56.9	# 17	Compare
			AR10	56.0	# 18	Compare
			AR1	42.8	# 17	Compare
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Resnet-50)	mAP_L	24.8	# 6	Compare
			AP50_L	49.5	# 3	Compare
			AP75_L	26.7	# 4	Compare
			AR1_L	23.9	# 4	Compare
			AR10_L	30.1	# 3	Compare
Video Instance Segmentation	Youtube-VIS 2022 Validation	InstanceFormer (Swin)	mAP_L	26.3	# 5	Compare
			AP50_L	44.6	# 4	Compare
			AP75_L	27.3	# 3	Compare
			AR1_L	25.0	# 3	Compare
			AR10_L	29.2	# 4	Compare
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(ResNet-50)	mask AP	45.6	# 29	Compare
			AP50	68.6	# 26	Compare
			AP75	49.6	# 28	Compare
			AR1	42.1	# 25	Compare
			AR10	53.5	# 24	Compare
Video Instance Segmentation	YouTube-VIS validation	InstanceFormer(Swin-L)	mask AP	56.3	# 19	Compare
			AP50	78.0	# 18	Compare
			AP75	64.2	# 18	Compare
			AR1	50.9	# 14	Compare
			AR10	61.6	# 14	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

InstanceFormer: An Online Video Instance Segmentation Framework

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove