In Defense of Online Models for Video Instance Segmentation

21 Jul 2022  ·  Junfeng Wu, Qihao Liu, Yi Jiang, Song Bai, Alan Yuille, Xiang Bai ·

In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. However, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highly desirable if online models can achieve comparable or even better performance than offline models. By dissecting current online models and offline models, we demonstrate that the main cause of the performance gap is the error-prone association between frames caused by the similar appearance among different instances in the feature space. Observing this, we propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association and fully exploit history information for stability. Despite its simplicity, our method outperforms all online and offline methods on three benchmarks. Specifically, we achieve 49.5 AP on YouTube-VIS 2019, a significant improvement of 13.2 AP and 2.1 AP over the prior online and offline art, respectively. Moreover, we achieve 30.2 AP on OVIS, a more challenging dataset with significant crowding and occlusions, surpassing the prior art by 14.8 AP. The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022). We hope the simplicity and effectiveness of our method, as well as our insight into current methods, could shed light on the exploration of VIS models.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Instance Segmentation OVIS validation IDOL (ResNet-50) mask AP 30.2 # 4
AP50 51.3 # 5
AP75 30 # 4
AR1 15 # 4
AR10 37.5 # 4
Video Instance Segmentation OVIS validation IDOL (Swin-L) mask AP 42.6 # 1
AP50 65.7 # 1
AP75 45.2 # 1
AR1 17.9 # 2
AR10 49.6 # 1
Video Instance Segmentation YouTube-VIS 2021 validation IDOL (Swin-L) mask AP 56.1 # 2
AP50 80.8 # 1
AP75 63.5 # 1
AR10 60.1 # 3
AR1 45 # 3
Video Instance Segmentation YouTube-VIS validation IDOL (Swin-L) mask AP 64.3 # 1
AP50 87.5 # 1
AP75 71.0 # 1
AR1 55.6 # 2
AR10 69.1 # 1
Video Instance Segmentation YouTube-VIS validation IDOL (ResNet-50) mask AP 49.5 # 9
AP50 74 # 9
AP75 52.9 # 12
AR1 47.7 # 7
AR10 58.7 # 7

Methods