Pre-training Meets Clustering: A Hybrid Extractive Multi-document Summarization Model

In this era where a large amount of information has flooded the Internet, manual extraction and consumption of relevant information is very difficult and time-consuming. Therefore, an automated document summarization tool is necessary to excerpt important information from a set of documents that have similar or related subjects. Multi-document summarization allows retrieval of important and relevant content from multiple documents while minimizing redundancy. A multi-document text summarization system is developed in this study using an unsupervised extractive-based approach. The proposed model is a fusion of two learning paradigms: the T5 pre-trained transformer model and the K-Means clustering algorithm. We perform the experiments on the benchmark news article corpus Document Understanding Conference (DUC2004). The ROUGE evaluation metrics were used to estimate the performance of the proposed approach on the DUC2004. Outcomes validate that our proposed model shows greatly enhanced performance as compared to the existent unsupervised state-of-the-art approaches.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Extractive Text Summarization DUC 2004 Pre-training-meets-Clustering-A-Hybrid-Extractive-Multi-Document-Summarization-Model Test ROGUE-1 34.013 # 1
Test ROGUE-2 8.266 # 1

Methods