2 code implementations • 7 Sep 2023 • Lars Lenssen, Erich Schubert
We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, provide two fast versions for the direct optimization, and discuss the use to choose the optimal number of clusters.
no code implementations • 5 Sep 2023 • Lars Lenssen, Erich Schubert
FastPAM recently introduced a speedup for large k to make it applicable for larger problems, but the method still has a runtime quadratic in N. In this chapter, we discuss a sparse and asymmetric variant of this problem, to be used for example on graph data such as road networks.
1 code implementation • 5 Sep 2023 • Erich Schubert, Andreas Lang
Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies.
no code implementations • 27 Dec 2022 • Daniel Boiar, Thomas Liebig, Erich Schubert
Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal".
no code implementations • 23 Dec 2022 • Erich Schubert
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters.
2 code implementations • 26 Sep 2022 • Lars Lenssen, Erich Schubert
One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$.
1 code implementation • 26 Sep 2022 • Erik Thordsen, Erich Schubert
The merit of projecting data onto linear subspaces is well known from, e. g., dimension reduction.
1 code implementation • Lernen, Wissen, Daten, Analysen 2021 • Erich Schubert
Unfortunately, we also show that the requirement to produce a hierarchical result is a limiting factor to the cluster quality, as the optimum result for a particular number of clusters 𝑘 does not have to be consistent with the optimum result with 𝑘+1 clusters.
no code implementations • 14 Jul 2021 • Erik Thordsen, Erich Schubert
Many approaches in the field of machine learning and data analysis rely on the assumption that the observed data lies on lower-dimensional manifolds.
1 code implementation • 8 Jul 2021 • Erich Schubert, Andreas Lang, Gloria Feher
Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors.
1 code implementation • 8 Jul 2021 • Erich Schubert
In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it.
3 code implementations • 12 Aug 2020 • Erich Schubert, Peter J. Rousseeuw
While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important.
1 code implementation • 23 Jun 2020 • Erik Thordsen, Erich Schubert
In this paper we introduce an orthogonal concept, which does not use any distances: we use the distribution of angles between neighbor points.
1 code implementation • 23 Jun 2020 • Andreas Lang, Erich Schubert
We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient.
1 code implementation • 10 Feb 2019 • Erich Schubert, Arthur Zimek
We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version.
4 code implementations • 12 Oct 2018 • Erich Schubert, Peter J. Rousseeuw
It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important.
3 code implementations • Lernen, Wissen, Daten, Analysen 2018 • Erich Schubert, Michael Gertz
Density-based clustering is closely associated with the two algorithms DBSCAN and OPTICS.
1 code implementation • 11 Aug 2017 • Erich Schubert, Andreas Spitz, Michael Weiler, Johanna Geiß, Michael Gertz
We then select keywords based on their significance and construct the word cloud based on the derived affinity.
1 code implementation • VLDB 2015 • Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Zufle, Klaus Arthur Schmid, Arthur Zimek
The challenges associated with handling uncertain data, in particular with querying and mining, are finding increasing attention in the research community.