Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code and video resources are available at http://vis.xyz/pub/pcan.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Instance Segmentation BDD100K val PCAN mMOTSA 27.4 # 1
Multiple Object Track and Segmentation BDD100K val PCAN mMOTSA 27.4 # 1
Multi-Object Tracking and Segmentation BDD100K val PCAN mMOTSA 27.4 # 3
Multi-Object Tracking and Segmentation BDD100K val QDTrack-mots-fix mMOTSA 23.5 # 4
Video Instance Segmentation BDD100K val QDTrack-mots-fix mMOTSA 23.5 # 2
Multi-Object Tracking and Segmentation BDD100K val QDTrack-mots mMOTSA 22.5 # 5
Video Instance Segmentation BDD100K val QDTrack-mots mMOTSA 22.5 # 3
Multi-Object Tracking and Segmentation BDD100K val STEm-Seg mMOTSA 12.2 # 7
Video Instance Segmentation BDD100K val STEm-Seg mMOTSA 12.2 # 5
Video Instance Segmentation BDD100K val MaskTrackRCNN mMOTSA 12.3 # 4
Multi-Object Tracking and Segmentation BDD100K val MaskTrackRCNN mMOTSA 12.3 # 6
Multi-Object Tracking and Segmentation BDD100K val SortIoU mMOTSA 10.3 # 8
Video Instance Segmentation BDD100K val SortIoU mMOTSA 10.3 # 6
Video Instance Segmentation YouTube-VIS validation PCAN(ResNet-50) mask AP 36.1 # 39
AP50 54.9 # 40
AP75 39.4 # 33
AR1 36.3 # 31
AR10 41.6 # 33

Methods


No methods listed for this paper. Add relevant methods here