SeqFormer: Sequential Transformer for Video Instance Segmentation

15 Dec 2021  ·  Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai ·

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Instance Segmentation HQ-YTVIS SeqFormer (Swin-L) Tube-Boundary AP 43.3 # 2
Video Instance Segmentation YouTube-VIS validation SeqFormer (Swin-L) mask AP 59.3 # 17
AP50 82.1 # 14
AP75 66.4 # 14
AR1 51.7 # 13
AR10 64.4 # 13
Video Instance Segmentation YouTube-VIS validation SeqFormer (ResNet-50) mask AP 45.1 # 30
AP50 66.9 # 28
AP75 50.5 # 26
AR1 45.6 # 21
AR10 54.6 # 23
mask AP 47.4 # 27
AP50 69.8 # 24
AP75 51.8 # 25
AR1 45.5 # 22
AR10 54.8 # 22
Video Instance Segmentation YouTube-VIS validation SeqFormer (ResNet-101) mask AP 49.0 # 25
AP50 71.1 # 23
AP75 55.7 # 21
AR1 46.8 # 20
AR10 56.9 # 20

Methods