Contrastive Learning for Multi-Object Tracking with Transformers

The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Multiple Object Tracking BDD100K test ContrasTR mMOTA 42.8 # 1
mHOTA 46.1 # 2
mIDF1 56.5 # 2
Multiple Object Tracking BDD100K val ContrasTR mMOTA 41.7 # 5
mIDF1 52.9 # 8
TETA - # 5
AssocA - # 5
Multi-Object Tracking MOT17 ContrasTR MOTA 73.7 # 25
IDF1 71.8 # 25
HOTA 58.9 # 25

Methods