A Graph Matching Perspective With Transformers on Video Instance Segmentation

Video Instance Segmentation (VIS) needs to automatically track and segment multiple objects in videos that rely on modeling the spatial-temporal interactions of the instances. This paper presents a graph matching-based method to formulate VIS. Unlike traditional tracking-by-detection paradigm or bottom-up generative solutions, we introduce a novel, learnable graph matching Transformer to predict the instances by heuristically learning the spatial-temporal relationships. Specifically, we take advantage of the powerful Transformer and exploit temporal feature aggregation to capture the long-term temporal information across frames in an implicit way. Our model generates instance proposals per-frame and performs the data association between current and historical frames via the proposed graph matching based on the enhanced feature. Furthermore, to make the whole network optimization end-to-end differentiable, we relax the original graph matching into continuous quadratic programming and then unroll the training of it into a deep graph network. Extensive experimental results on two representative available benchmarks, including YouTube-VIS19 and OVIS, verify the effectiveness of our graph matching Transformer.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods