An Efficient Protocol for Distributed Column Subset Selection in the Entrywise $\ell_p$ Norm

1 Jan 2021 · Shuli Jiang, Dongyu Li, Irene Mengze Li, Arvind V. Mahankali, David Woodruff ·

We give a distributed protocol with nearly-optimal communication and number of rounds for Column Subset Selection with respect to the entrywise {$\ell_1$} norm ($k$-CSS$_1$), and more generally, for the $\ell_p$-norm with $1 \leq p < 2$. We study matrix factorization in $\ell_1$-norm loss, rather than the more standard Frobenius norm loss, because the $\ell_1$ norm is more robust to noise. This loss function arises naturally in a wide range of computer vision and robotics problems, such as 3D reconstruction and structure-from-motion. In the distributed setting, we consider $s$ servers in the standard coordinator model of communication, where the columns of the input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$) are distributed across the $s$ servers. We give a protocol in this model with $\tilde{O}(sdk)$ communication, $1$ round, and polynomial running time, and which achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\poly(\log nd)$-approximation to the best possible column subset. A key ingredient in our proof is the reduction to the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to use strong coreset constructions for Euclidean norms, which previously had not been used in this context. This naturally also allows us to implement our algorithm in the popular streaming model of computation. We further propose a greedy algorithm for selecting columns, which can be used by the coordinator, and show the first provable guarantees for a greedy algorithm for the $\ell_{1,2}$ norm. Finally, we implement our protocol and give significant practical advantages on real data sets.

PDF Abstract