Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

8 Apr 2020  ·  Federico Bianchi, Silvia Terragni, Dirk Hovy ·

Topic models extract meaningful groups of words from documents, allowing for a better understanding of data. However, the solutions are often not coherent enough, and thus harder to interpret... Coherence can be improved by adding more contextual knowledge to the model. Recently, neural topic models have become available, while BERT-based representations have further pushed the state of the art of neural models in general. We combine pre-trained representations and neural topic models. Pre-trained BERT sentence embeddings indeed support the generation of more meaningful and coherent topics than either standard LDA or existing neural topic models. Results on four datasets show that our approach effectively increases topic coherence. read more

PDF Abstract


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.