TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	mask AP	27.7	# 30
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	AP50	51.9	# 26
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	AP75	24.9	# 29
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	AR1	14.9	# 23
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	AR10	33.0	# 23
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	mask AP	57.5	# 11
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	AP50	80.6	# 11
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	AP75	61.0	# 14
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	AR10	62.6	# 11
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	AR1	47.7	# 7
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	mask AP	63.0	# 12
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	AP50	86.9	# 8
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	AP75	67.9	# 11
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	AR1	56.3	# 4
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	AR10	68.1	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vita-video-instance-segmentation-via-object/video-instance-segmentation-on-youtube-vis-2)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-2?p=vita-video-instance-segmentation-via-object)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vita-video-instance-segmentation-via-object/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=vita-video-instance-segmentation-via-object)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vita-video-instance-segmentation-via-object/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=vita-video-instance-segmentation-via-object)`

VITA: Video Instance Segmentation via Object Token Association

9 Jun 2022 · Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim ·

We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector as a means of distilling object-specific contexts into object tokens. VITA accomplishes video-level understanding by associating frame-level object tokens without using spatio-temporal backbone features. By effectively building relationships between objects using the condensed information, VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 & 2021, and 19.6 AP on OVIS. Moreover, thanks to its object token-based structure that is disjoint from the backbone features, VITA shows several practical advantages that previous offline VIS methods have not explored - handling long and high-resolution videos with a common GPU, and freezing a frame-level detector trained on image domain. Code is available at https://github.com/sukjunhwang/VITA.

PDF Abstract

Code

Add Remove Mark official

sukjunhwang/vita official

Tasks

Add Remove

Instance Segmentation

Object

Semantic Segmentation

Video Instance Segmentation

Datasets

MS COCO

YouTube-VIS 2019

OVIS YouTube-VIS 2021

Results from the Paper

Edit

Ranked #11 on Video Instance Segmentation on YouTube-VIS 2021 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Instance Segmentation	OVIS validation	VITA (Swin-L)	mask AP	27.7	# 30	Compare
			AP50	51.9	# 26	Compare
			AP75	24.9	# 29	Compare
			AR1	14.9	# 23	Compare
			AR10	33.0	# 23	Compare
Video Instance Segmentation	YouTube-VIS 2021	VITA (Swin-L)	mask AP	57.5	# 11	Compare
			AP50	80.6	# 11	Compare
			AP75	61.0	# 14	Compare
			AR10	62.6	# 11	Compare
			AR1	47.7	# 7	Compare
Video Instance Segmentation	YouTube-VIS validation	VITA (Swin-L)	mask AP	63.0	# 12	Compare
			AP50	86.9	# 8	Compare
			AP75	67.9	# 11	Compare
			AR1	56.3	# 4	Compare
			AR10	68.1	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VITA: Video Instance Segmentation via Object Token Association

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove