TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Instance Segmentation	HQ-YTVIS	SeqFormer (Swin-L)	Tube-Boundary AP	43.3	# 2
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	mask AP	59.3	# 17
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	AP50	82.1	# 14
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	AP75	66.4	# 14
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	AR1	51.7	# 13
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	AR10	64.4	# 13
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	mask AP	45.1	# 30
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AP50	66.9	# 28
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AP75	50.5	# 26
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AR1	45.6	# 21
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AR10	54.6	# 23
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	mask AP	47.4	# 27
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AP50	69.8	# 24
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AP75	51.8	# 25
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AR1	45.5	# 22
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	AR10	54.8	# 22
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	mask AP	49.0	# 25
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	AP50	71.1	# 23
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	AP75	55.7	# 21
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	AR1	46.8	# 20
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	AR10	56.9	# 20

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seqformer-a-frustratingly-simple-model-for/video-instance-segmentation-on-hq-ytvis)](https://paperswithcode.com/sota/video-instance-segmentation-on-hq-ytvis?p=seqformer-a-frustratingly-simple-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seqformer-a-frustratingly-simple-model-for/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=seqformer-a-frustratingly-simple-model-for)`

SeqFormer: Sequential Transformer for Video Instance Segmentation

15 Dec 2021 · Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai ·

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

PDF Abstract

Code

Add Remove Mark official

wjf5203/SeqFormer official

338

wjf5203/vnext

592

Tasks

Add Remove

Instance Segmentation

Semantic Segmentation

Video Instance Segmentation

Datasets

MS COCO

YouTube-VIS 2019 YouTube-VIS 2021

HQ-YTVIS

Results from the Paper

Edit

Ranked #2 on Video Instance Segmentation on HQ-YTVIS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Instance Segmentation	HQ-YTVIS	SeqFormer (Swin-L)	Tube-Boundary AP	43.3	# 2	Compare
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (Swin-L)	mask AP	59.3	# 17	Compare
			AP50	82.1	# 14	Compare
			AP75	66.4	# 14	Compare
			AR1	51.7	# 13	Compare
			AR10	64.4	# 13	Compare
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-50)	mask AP	45.1	# 30	Compare
			AP50	66.9	# 28	Compare
			AP75	50.5	# 26	Compare
			AR1	45.6	# 21	Compare
			AR10	54.6	# 23	Compare
			mask AP	47.4	# 27	Compare
			AP50	69.8	# 24	Compare
			AP75	51.8	# 25	Compare
			AR1	45.5	# 22	Compare
			AR10	54.8	# 22	Compare
Video Instance Segmentation	YouTube-VIS validation	SeqFormer (ResNet-101)	mask AP	49.0	# 25	Compare
			AP50	71.1	# 23	Compare
			AP75	55.7	# 21	Compare
			AR1	46.8	# 20	Compare
			AR10	56.9	# 20	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

SeqFormer: Sequential Transformer for Video Instance Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove