A parallel sampling based clustering

5 Dec 2014 · Aditya AV Sastry, Kalyan Netti ·

The problem of automatically clustering data is an age old problem. People have created numerous algorithms to tackle this problem. The execution time of any of this algorithm grows with the number of input points and the number of cluster centers required. To reduce the number of input points we could average the points locally and use the means or the local centers as the input for clustering. However since the required number of local centers is very high, running the clustering algorithm on the entire dataset to obtain these representational points is very time consuming. To remedy this problem, in this paper we are proposing two subclustering schemes where by we subdivide the dataset into smaller sets and run the clustering algorithm on the smaller datasets to obtain the required number of datapoints to run our clustering algorithm with. As we are subdividing the given dataset, we could run clustering algorithm on each smaller piece of the dataset in parallel. We found that both parallel and serial execution of this method to be much faster than the original clustering algorithm and error in running the clustering algorithm on a reduced set to be very less.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Clustering

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Edit

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

A parallel sampling based clustering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove