DVIS: Decoupled Video Instance Segmentation Framework

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Instance Segmentation OVIS validation DVIS(Swin-L, Offline) mask AP 49.9 # 3
AP50 75.9 # 2
AP75 53.0 # 4
AR1 19.4 # 3
AR10 55.3 # 2
Video Instance Segmentation OVIS validation DVIS(Swin-L, Online) mask AP 47.1 # 6
AP50 71.9 # 5
AP75 49.2 # 6
AR1 19.4 # 3
AR10 52.5 # 4
Video Panoptic Segmentation VIPSeg DVIS(Swin-L) VPQ 57.6 # 3
STQ 55.3 # 3
Video Instance Segmentation YouTube-VIS 2021 DVIS(Swin-L) mask AP 60.1 # 6
AP50 83.0 # 3
AP75 68.4 # 4
AR10 65.7 # 3
AR1 47.7 # 7
Video Instance Segmentation Youtube-VIS 2022 Validation DVIS(Swin-L) mAP_L 45.9 # 3
AP50_L 69.0 # 2
AP75_L 48.8 # 2
AR1_L 37.2 # 2
AR10_L 51.8 # 2
Video Instance Segmentation YouTube-VIS validation DVIS(Swin-L) mask AP 64.9 # 6
AP50 88.0 # 4
AP75 72.7 # 4
AR1 56.5 # 3
AR10 70.3 # 3

Methods


No methods listed for this paper. Add relevant methods here