Topic modeling topic coverage dataset

Introduced by Korenčić et al. in A Topic Coverage Approach to Evaluation of Topic Models

A prevalent use case of topic models is that of topic discovery. However, most of the topic model evaluation methods rely on abstract metrics such as perplexity or topic coherence. The topic coverage approach is to measure the models' performance by matching model-generated topics to topics discovered by humans. This way, the models are evaluated in the context of their use, by essentially simulating topic modeling in a fixed setting defined by a text collection and a set of reference topics.

Reference topics represent a ground truth that can be used to evaluate both topic models and other measures of model performance. The coverage approach enables large-scale automatic evaluation of both existing and future topic models.

The topic coverage dataset consists of two text collections and two sets of reference topics. These two sub-datasets correspond to two domains (news text and biological text) where topic models are used for topic discovery in large text collections. The reference topics consist of model-generated topics inspected, selected, and curated by humans.

Each dataset contains a corpus of preprocessed (tokenized) texts and a set of reference topics, each represented by a list of words and text documents. The dataset details, including the instruction for the use of the data and supporting code, are here:

The coverage measures that can be used to evaluate topic models are described in the accompanying paper, whereas the code and the instructions can be found in the github repo.


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.