InstanceFormer: An Online Video Instance Segmentation Framework

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Instance Segmentation OVIS validation InstanceFormer (Swin-L) mask AP 22.8 # 32
AP50 42.5 # 30
AP75 21.61 # 30
AR1 12.9 # 24
AR10 29.3 # 24
Video Instance Segmentation OVIS validation InstanceFormer(ResNet-50) mask AP 20.0 # 33
AP50 40.7 # 31
AP75 18.1 # 32
AR1 12 # 25
AR10 27.1 # 26
Video Instance Segmentation YouTube-VIS 2021 InstanceFormer (ResNet-50) mask AP 40.8 # 23
AP50 62.4 # 23
AP75 43.7 # 23
AR10 48.1 # 23
AR1 36.1 # 23
Video Instance Segmentation YouTube-VIS 2021 InstanceFormer (Swin-L) mask AP 51.0 # 17
AP50 73.7 # 17
AP75 56.9 # 17
AR10 56.0 # 18
AR1 42.8 # 17
Video Instance Segmentation Youtube-VIS 2022 Validation InstanceFormer (Resnet-50) mAP_L 24.8 # 6
AP50_L 49.5 # 3
AP75_L 26.7 # 4
AR1_L 23.9 # 4
AR10_L 30.1 # 3
Video Instance Segmentation Youtube-VIS 2022 Validation InstanceFormer (Swin) mAP_L 26.3 # 5
AP50_L 44.6 # 4
AP75_L 27.3 # 3
AR1_L 25.0 # 3
AR10_L 29.2 # 4
Video Instance Segmentation YouTube-VIS validation InstanceFormer(ResNet-50) mask AP 45.6 # 29
AP50 68.6 # 26
AP75 49.6 # 28
AR1 42.1 # 25
AR10 53.5 # 24
Video Instance Segmentation YouTube-VIS validation InstanceFormer(Swin-L) mask AP 56.3 # 19
AP50 78.0 # 18
AP75 64.2 # 18
AR1 50.9 # 14
AR10 61.6 # 14

Methods


No methods listed for this paper. Add relevant methods here