TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Instance Segmentation	HQ-YTVIS	VMT (Swin-L)	Tube-Boundary AP	44.8	# 1
Video Instance Segmentation	HQ-YTVIS	VMT (R101)	Tube-Boundary AP	32.5	# 3
Video Instance Segmentation	HQ-YTVIS	VMT (R50)	Tube-Boundary AP	30.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-mask-transfiner-for-high-quality-video/video-instance-segmentation-on-hq-ytvis)](https://paperswithcode.com/sota/video-instance-segmentation-on-hq-ytvis?p=video-mask-transfiner-for-high-quality-video)`

Video Mask Transfiner for High-Quality Video Instance Segmentation

28 Jul 2022 · Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu ·

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

PDF Abstract

Code

Add Remove Mark official

SysCV/vmt

Tasks

Add Remove

Instance Segmentation

Semantic Segmentation

Video Instance Segmentation

Vocal Bursts Intensity Prediction

Datasets

Introduced in the Paper:

HQ-YTVIS

Used in the Paper:

BDD100K

YouTube-VIS 2019

Results from the Paper

Edit

Ranked #1 on Video Instance Segmentation on HQ-YTVIS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Instance Segmentation	HQ-YTVIS	VMT (Swin-L)	Tube-Boundary AP	44.8	# 1	Compare
Video Instance Segmentation	HQ-YTVIS	VMT (R101)	Tube-Boundary AP	32.5	# 3	Compare
Video Instance Segmentation	HQ-YTVIS	VMT (R50)	Tube-Boundary AP	30.7	# 4	Compare

Methods

Add Remove

Test

Edit Social Preview

Video Mask Transfiner for High-Quality Video Instance Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove