VITA: Video Instance Segmentation via Object Token Association

9 Jun 2022  ·  Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim ·

We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector as a means of distilling object-specific contexts into object tokens. VITA accomplishes video-level understanding by associating frame-level object tokens without using spatio-temporal backbone features. By effectively building relationships between objects using the condensed information, VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 & 2021, and 19.6 AP on OVIS. Moreover, thanks to its object token-based structure that is disjoint from the backbone features, VITA shows several practical advantages that previous offline VIS methods have not explored - handling long and high-resolution videos with a common GPU, and freezing a frame-level detector trained on image domain. Code is available at https://github.com/sukjunhwang/VITA.

PDF Abstract

Results from the Paper


Ranked #11 on Video Instance Segmentation on YouTube-VIS 2021 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Instance Segmentation OVIS validation VITA (Swin-L) mask AP 27.7 # 30
AP50 51.9 # 26
AP75 24.9 # 29
AR1 14.9 # 23
AR10 33.0 # 23
Video Instance Segmentation YouTube-VIS 2021 VITA (Swin-L) mask AP 57.5 # 11
AP50 80.6 # 11
AP75 61.0 # 14
AR10 62.6 # 11
AR1 47.7 # 7
Video Instance Segmentation YouTube-VIS validation VITA (Swin-L) mask AP 63.0 # 12
AP50 86.9 # 8
AP75 67.9 # 11
AR1 56.3 # 4
AR10 68.1 # 9

Methods


No methods listed for this paper. Add relevant methods here