Topic modeling topic coverage dataset

Introduced by Korenčić et al. in A Topic Coverage Approach to Evaluation of Topic Models

A prevalent use case of topic models is that of topic discovery. However, most of the topic model evaluation methods rely on abstract metrics such as perplexity or topic coherence. The topic coverage approach is to measure the models' performance by matching model-generated topics to topics discovered by humans. This way, the models are evaluated in the context of their use, by essentially simulating topic modeling in a fixed setting defined by a text collection and a set of reference topics.

Reference topics represent a ground truth that can be used to evaluate both topic models and other measures of model performance. The coverage approach enables large-scale automatic evaluation of both existing and future topic models.

The topic coverage dataset consists of two text collections and two sets of reference topics. These two sub-datasets correspond to two domains (news text and biological text) where topic models are used for topic discovery in large text collections. The reference topics consist of model-generated topics inspected, selected, and curated by humans.

Each dataset contains a corpus of preprocessed (tokenized) texts and a set of reference topics, each represented by a list of words and text documents. The dataset details, including the instruction for the use of the data and supporting code, are here: https://github.com/dkorenci/topic_coverage/blob/main/data.readme.txt

The coverage measures that can be used to evaluate topic models are described in the accompanying paper, whereas the code and the instructions can be found in the github repo.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Topic coverage	Topic modeling topic coverage dataset	AuCDC

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Topic coverage

Usage

Topic modeling topic coverage dataset

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages