Cross-Attention Guided Network for Visual Tracking
Most traditional Siamese trackers are used to process the feature information of the template branch and search branch respectively, and to correlate with each other to get the location of the max response which regarded as the target location. However, it is difficult for these traditional methods to reduce the influence of the similar object and background clutters because of ignoring the information between target template and search frame. Accordingly, a novel cross-attention guided network (SiamCAN), which involving cross-channel attention and self-spatial attention to learn discriminative object representations, is proposed to solve the above problem in this paper. Here, the channel attention of target template is introduced to guide the feature learning for search branch, and then the self-spatial attention is used to localize the informative part location after the correlation processing. Moreover, in order to get more accurate target estimation, an anchor-free mechanism and a distance-IoU (DIoU) loss are applied to minimize the distance between the center points of the ground-truth bounding boxes and predicted boundary boxes in the model training process. The proposed method achieves state-of-the-art performance on four visual tracking benchmarks including UAV123, OTB100, VOT2018 and VOT2019, outperforming the strong baseline, SiamRPN++, by 0.417 $\displaystyle \rightarrow$ 0.445 and 0.292 $\displaystyle \rightarrow$ 0.323 EAO on VOT2018 and VOT2019.
PDF Abstract