NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation

29 Aug 2023  ·  Tim Meinhardt, Matt Feiszli, Yuchen Fan, Laura Leal-Taixe, Rakesh Ranjan ·

Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Instance Segmentation OVIS validation NOVIS (ResNet-50) mask AP 32.7 # 26
AP50 56.2 # 22
AP75 32.6 # 25
AR1 15.7 # 20
AR10 37.1 # 21
Video Instance Segmentation OVIS validation NOVIS (Swin-L) mask AP 43.5 # 11
AP50 68.3 # 12
AP75 43.8 # 14
AR1 19.4 # 3
AR10 46.9 # 12
Video Instance Segmentation YouTube-VIS 2021 NOVIS (ResNet-50) mask AP 47.2 # 21
AP50 69.4 # 20
AP75 50.0 # 21
AR10 54.4 # 21
AR1 41.3 # 20
Video Instance Segmentation YouTube-VIS 2021 NOVIS (Swin-L) mask AP 59.8 # 8
AP50 82.0 # 5
AP75 66.5 # 7
AR10 64.4 # 8
AR1 47.9 # 6
Video Instance Segmentation YouTube-VIS validation NOVIS (ResNet-50) mask AP 52.8 # 22
AP50 75.7 # 20
AP75 56.9 # 20
AR1 50.3 # 16
AR10 60.6 # 16
Video Instance Segmentation YouTube-VIS validation NOVIS (Swin-L) mask AP 65.7 # 5
AP50 87.8 # 5
AP75 72.2 # 5
AR1 56.3 # 4
AR10 70.3 # 3

Methods