Scalable Nearest Neighbor Search for Optimal Transport

ICML 2020  ·  Arturs Backurs, Yihe Dong, Piotr Indyk, Ilya Razenshteyn, Tal Wagner ·

The Optimal Transport (a.k.a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this work, we study fast tree-based approximation algorithms for searching nearest neighbors w.r.t. the Wasserstein-1 distance. A standard tree-based technique, known as Quadtree, has been previously shown to obtain good results. We introduce a variant of this algorithm, called Flowtree, and formally prove it achieves asymptotically better accuracy. Our extensive experiments, on real-world text and image datasets, show that Flowtree improves over various baselines and existing methods in either running time or accuracy. In particular, its quality of approximation is in line with previous high-accuracy methods, while its running time is much faster.

PDF Abstract ICML 2020 PDF

Categories


Data Structures and Algorithms

Datasets


  Add Datasets introduced or used in this paper