A Graph Matching Perspective With Transformers on Video Instance Segmentation
Video Instance Segmentation (VIS) needs to automatically track and segment multiple objects in videos that rely on modeling the spatial-temporal interactions of the instances. This paper presents a graph matching-based method to formulate VIS. Unlike traditional tracking-by-detection paradigm or bottom-up generative solutions, we introduce a novel, learnable graph matching Transformer to predict the instances by heuristically learning the spatial-temporal relationships. Specifically, we take advantage of the powerful Transformer and exploit temporal feature aggregation to capture the long-term temporal information across frames in an implicit way. Our model generates instance proposals per-frame and performs the data association between current and historical frames via the proposed graph matching based on the enhanced feature. Furthermore, to make the whole network optimization end-to-end differentiable, we relax the original graph matching into continuous quadratic programming and then unroll the training of it into a deep graph network. Extensive experimental results on two representative available benchmarks, including YouTube-VIS19 and OVIS, verify the effectiveness of our graph matching Transformer.
PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract