BootsTAP: Bootstrapped Training for Tracking-Any-Point

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Point Tracking TAP-Vid-DAVIS BootsTAPIR Average Jaccard 66.2 # 2
Average PCK 78.1 # 3
Occlusion Accuracy 91 # 1
Point Tracking TAP-Vid-Kinetics BootsTAPIR Average Jaccard 61.4 # 1
Occlusion Accuracy 89.7 # 1
Average PCK 74.2 # 1
Point Tracking TAP-Vid-RGB-Stacking BootsTAPIR Average Jaccard 72.4 # 1
Average PCK 83.1 # 2
Occlusion Accuracy 91.2 # 1

Methods


No methods listed for this paper. Add relevant methods here