SeqFormer: Sequential Transformer for Video Instance Segmentation

15 Dec 2021  ·  Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai ·

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Instance Segmentation HQ-YTVIS SeqFormer (Swin-L) Tube-Boundary AP 43.3 # 2
Video Instance Segmentation YouTube-VIS validation SeqFormer (Swin-L) mask AP 59.3 # 5
AP50 82.1 # 5
AP75 66.4 # 5
AR1 51.7 # 4
AR10 64.4 # 4
Video Instance Segmentation YouTube-VIS validation SeqFormer (ResNet-50) mask AP 45.1 # 16
AP50 66.9 # 16
AP75 50.5 # 14
AR1 45.6 # 10
AR10 54.6 # 12
mask AP 47.4 # 13
AP50 69.8 # 12
AP75 51.8 # 13
AR1 45.5 # 11
AR10 54.8 # 11
Video Instance Segmentation YouTube-VIS validation SeqFormer (ResNet-101) mask AP 49.0 # 11
AP50 71.1 # 11
AP75 55.7 # 9
AR1 46.8 # 9
AR10 56.9 # 9

Methods