# An Analysis of Lemmatization on Topic Models of Morphologically Rich Language

13 Aug 2016  ·  , , ·

Topic models are typically represented by top-$m$ word lists for human interpretation. The corpus is often pre-processed with lemmatization (or stemming) so that those representations are not undermined by a proliferation of words with similar meanings, but there is little public work on the effects of that pre-processing. Recent work studied the effect of stemming on topic models of English texts and found no supporting evidence for the practice. We study the effect of lemmatization on topic models of Russian Wikipedia articles, finding in one configuration that it significantly improves interpretability according to a word intrusion metric. We conclude that lemmatization may benefit topic models on morphologically rich languages, but that further investigation is needed.

PDF Abstract

## Code Add Remove Mark official

No code implementations yet. Submit your code now

## Datasets

Add Datasets introduced or used in this paper

## Results from the Paper Edit

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.