We study the classic $k$-means/median clustering, which are fundamental
problems in unsupervised learning, in the setting where data are partitioned
across multiple sites, and where we are allowed to discard a small portion of
the data by labeling them as outliers. We propose a simple approach based on
constructing small summary for the original dataset...
The proposed method is
time and communication efficient, has good approximation guarantees, and can
identify the global outliers effectively. To the best of our knowledge, this is
the first practical algorithm with theoretical guarantees for distributed
clustering with outliers. Our experiments on both real and synthetic data have
demonstrated the clear superiority of our algorithm against all the baseline
algorithms in almost all metrics.