To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS).
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario even with random sampled partial observations, e. g., only 25% ~ 50% of the input sequence.
Evaluation metrics in machine learning are often hardly taken as loss functions, as they could be non-differentiable and non-decomposable, e. g., average precision and F1 score.
Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation.
The key insight of QueryInst is to leverage the intrinsic one-to-one correspondence in object queries across different stages, as well as one-to-one correspondence between mask RoI features and object queries in the same stage.
Ranked #10 on Instance Segmentation on COCO test-dev
For temporal information modeling in VIS, we present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
Ranked #1 on Video Instance Segmentation on OVIS validation