Visual TransforMatcher: Efficient Match-to-Match Attention for Visual Correspondence

29 Sep 2021 · Seung Wook Kim, Juhong Min, Minsu Cho ·

Establishing correspondences between images remains a challenging task, especially under large appearance changes due to different viewpoints and intra-class variations. In this work, we introduce a strong image matching learner, dubbed \textit{Visual Transformatcher}, which builds on the success of the Transformers in vision domains. Unlike previous self-attention schemes over image matches, it performs match-to-match attention for precise match localization and dynamically updates matching scores in a global context. To handle a large number of candidate matches in a dense correlation map, we develop a light-weight architecture with an effective positional encoding technique for matching. In experiments, our method achieves the new state of the art on the SPair-71k dataset, while performing on par with existing state-of-the-art models on the PF-PASCAL and PF-WILLOW datasets, showing the effectiveness of the proposed approach. We also provide the results of extensive ablation studies to justify the design choices of our model. The code and trained weights will be released upon acceptance.

PDF Abstract