Clustering

Large-scale spectral clustering

Introduced by Li et al. in Divide-and-conquer based Large-Scale Spectral Clustering

Spectral Clustering

Spectral clustering aims to partition the data points into $k$ clusters using the spectrum of the graph Laplacians Given a dataset $X$ with $N$ data points, spectral clustering algorithm first constructs similarity matrix ${W}$, where ${w_{ij}}$ indicates the similarity between data points $x_i$ and $x_j$ via a similarity measure metric.

Let $L=D-W$, where $L$ is called graph Laplacian and ${D}$ is a diagonal matrix with $d_{ii} = \sum_ {j=1}^n w_{ij}$. The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SC_obj} {\max_{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where $\operatorname{tr(\cdot)}$ denotes the trace norm of a matrix. The rows of matrix ${U}$ are the low dimensional embedding of the original data points. Generally, spectral clustering computes ${U}$ as the bottom $k$ eigenvectors of ${L}$, and finally applies $k$-means on ${U}$ to obtain the clustering results.

Large-scale Spectral Clustering

To capture the relationship between all data points in $X$, an $N\times N$ similarity matrix is needed to be constructed in conventional spectral clustering, which costs $O(N^2d)$ time and $O(N^2)$ memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where $R = {r_1,r_2,\dots, r_p }$ ($p \ll N$) is a set of landmarks with the same dimension to $X$, $\Phi(\cdot)$ indicate a similarity measure metric, and $B\in \mathbb{R}^{N\times p}$ is the similarity sub-matrix to represent the $X \in \mathbb{R}^{N\times d}$ with respect to the $R\in \mathbb{R}^{p\times d}$.

For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix $W$ can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix $W$ is $(N+p)\times (N+p)$. Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, $k$-means is conducted on the embedding to obtain clustering results.

The clustering result is directly related to the quality of $B$ that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.

Source: Divide-and-conquer based Large-Scale Spectral Clustering

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Clustering 5 41.67%
Image/Document Clustering 3 25.00%
Semantic Segmentation 2 16.67%
Incremental Learning 1 8.33%
Image Clustering 1 8.33%

Components


Component Type
Spectral Clustering
Clustering

Categories