Explainable k-Means and k-Medians Clustering

Clustering is a popular unsupervised learning method for geometric data. Unfortunately, many clustering algorithms use global properties of the data, and there are no simple explanations for cluster assignments. To improve interpretability, we consider using a small threshold tree to partition a dataset into clusters. This leads to cluster assignments that can be explained by very few feature values in a straightforward manner. We study this problem from a theoretical viewpoint, measuring the output quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and we prove that any explainable clustering must incur an \Omega(\log k) approximation compared to the optimal clustering. On the upper bound side, we design efficient algorithms that produce explainable clusters using a tree with k leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, which is a surprising result that nearly matches our lower bounds. For general k \geq 2, our algorithm is an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of the dimensionality and input size.

PDF ICML 2020 PDF
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here