Tracking by Natural Language Specification

This paper strives to track a target object in a video. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target, which provides a more natural human-machine interaction as well as a means to improve tracking results. We define three variants of tracking by language specification: one relying on lingual target specification only, one relying on visual target specification based on language, and one leveraging their joint capacity. To show the potential of tracking by natural language specification we extend two popular tracking datasets with lingual descriptions and report experiments. Finally, we also sketch new tracking scenarios in surveillance and other live video streams that become feasible with a lingual specification of the target.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation J-HMDB Li et al. Precision@0.5 0.578 # 20
Precision@0.6 0.335 # 21
Precision@0.7 0.103 # 20
Precision@0.8 0.060 # 13
Precision@0.9 0.000 # 11
AP 0.173 # 17
IoU overall 0.529 # 20
IoU mean 0.491 # 20

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Referring Expression Segmentation A2D Sentences Li et al. Precision@0.5 0.387 # 26
Precision@0.9 0.001 # 26
IoU overall 0.515 # 26
IoU mean 0.354 # 26
Precision@0.6 0.290 # 25
Precision@0.7 0.175 # 25
Precision@0.8 0.066 # 25
AP 0.163 # 21

Methods


No methods listed for this paper. Add relevant methods here