RefineVIS: Video Instance Segmentation with Temporal Attention Refinement

7 Jun 2023  ·  Andre Abrantes, Jiang Wang, Peng Chu, Quanzeng You, Zicheng Liu ·

We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible for associating objects across frames and a segmentation representation that produces accurate segmentation masks. Contrastive learning is utilized to learn temporally stable association representations. A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships and a novel temporal contrastive denoising technique. Our method supports both online and offline inference. It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets. The visualization shows that the TAR module can generate more accurate instance segmentation masks, particularly for challenging cases such as highly occluded objects.

PDF Abstract

Results from the Paper


Ranked #3 on Video Instance Segmentation on YouTube-VIS 2021 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Instance Segmentation OVIS validation RefineVIS (Swin-L, offline) mask AP 46 # 8
AP50 70.4 # 7
AP75 48.4 # 7
AR1 19.1 # 7
AR10 51.2 # 5
Video Instance Segmentation YouTube-VIS 2021 RefineVIS (Swin-L, online) mask AP 61.4 # 3
AP50 84.1 # 2
AP75 68.5 # 3
AR10 65.2 # 4
AR1 48.3 # 5
Video Instance Segmentation YouTube-VIS validation RefineVIS (Swin-L, offline) mask AP 64.4 # 8
AP50 88.3 # 3
AP75 72.2 # 5
AR1 55.8 # 8
AR10 68.4 # 8

Methods