Do Different Tracking Tasks Require Different Appearance Models?

Tracking objects of interest in a video is one of the most popular and widely applicable problems in computer vision. However, with the years, a Cambrian explosion of use cases and benchmarks has fragmented the problem in a multitude of different experimental setups. As a consequence, the literature has fragmented too, and now novel approaches proposed by the community are usually specialised to fit only one specific setup. To understand to what extent this specialisation is necessary, in this work we present UniTrack, a solution to address five different tasks within the same framework. UniTrack consists of a single and task-agnostic appearance model, which can be learned in a supervised or self-supervised fashion, and multiple ``heads'' that address individual tasks and do not require training. We show how most tracking tasks can be solved within this framework, and that the same appearance model can be successfully used to obtain results that are competitive against specialised methods for most of the tasks considered. The framework also allows us to analyse appearance models obtained with the most recent self-supervised methods, thus extending their evaluation and comparison to a larger variety of important problems.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Object Segmentation DAVIS 2017 UniTrack mIoU 58.4 # 2
Pose Estimation J-HMDB UniTrack_i18 Mean PCK@0.2 80.5 # 5
Mean PCK@0.1 58.3 # 3
Multi-Object Tracking MOT16 UniTrack MOTA 74.7 # 6
IDF1 71.8 # 4
IDs 683 # 2
Multi-Object Tracking MOTS20 UniTrack sMOTSA 68.9 # 2
IDF1 67.2 # 1
IDs 622 # 1
Visual Object Tracking OTB-2015 UniTrack_DCF AUC 0.618 # 13
Pose Tracking PoseTrack2018 UniTrack MOTA 63.5 # 2
IDF1 73.2 # 2
IDs 6760 # 1
Video Instance Segmentation YouTube-VIS validation UniTrack mask AP 30.1 # 50


No methods listed for this paper. Add relevant methods here