We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.
Ranked #1 on Multi-Object Tracking and Segmentation on BDD100K
We analogize MVS back to its nature of a feature matching task and therefore propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information within and across images.
A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association.
Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance.
Ranked #41 on Semantic Segmentation on ADE20K
We introduce a robust, real-time, high-resolution human video matting method that achieves new state-of-the-art performance.
Ranked #1 on Video Matting on VideoMatte240K
Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM.