Optimal Transport Aggregation for Visual Place Recognition

27 Nov 2023  ยท  Sergio Izquierdo, Javier Civera ยท

The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Place Recognition Mapillary test DINOv2 SALAD Recall@1 75 # 1
Recall@5 88.8 # 1
Recall@10 91.3 # 1
Visual Place Recognition Mapillary val DINOv2 SALAD Recall@1 92.2 # 1
Recall@5 96.4 # 1
Recall@10 97 # 2
Visual Place Recognition Nordland DINOv2 SALAD (1-frame threshold) Recall@1 85.2 # 2
Recall@5 98.5 # 1
Recall@10 95.5 # 2
Visual Place Recognition Pittsburgh-250k-test DINOv2 SALAD Recall@1 95.1 # 2
Recall@5 98.5 # 2
Recall@10 99.1 # 1
Visual Place Recognition SPED DINOv2 SALAD Recall@1 92.1 # 1
Recall@5 96.2 # 1
Recall@10 96.5 # 1

Methods