Coherence-Based Document Clustering

29 Sep 2021 · Anton Frederik Thielmann, Christoph Weisser, Thomas Kneib, Benjamin Saefken ·

Latent Dirichlet Allocation or Non-negative Matrix Factorization are just two widely used algorithms for extracting latent topics from large text corpora. While these algorithms differ in their modeling approach, they have in common that hyperparameter optimization is difficult and is mainly achieved by maximizing the extracted topic coherence scores via a grid search. Models using word-document embeddings can automatically detect the number of latent topics, but tend to have problems with smaller datasets and often require pre-trained embedding layers for successful topic extraction. We leverage widely used coherence scores by integrating them into a novel document-level clustering approach using keyword extraction methods. The metric by which most topic extraction methods optimize their hyperparameters is thus optimized during clustering, resulting in ultra-coherent clusters. Moreover, unlike traditional methods, the number of extracted topics or clusters does not need to be determined in advance, saving us an additional optimization step and a time- and computationally-intensive grid search. Additionally, the number of topics is detected much more accurately than by models leveraging word-document embeddings.

PDF Abstract