GenCompareSum: a hybrid unsupervised summarization method using salience

Text summarization (TS) is an important NLP task. Pre-trained Language Models (PLMs) have been used to improve the performance of TS. However, PLMs are limited by their need of labelled training data and by their attention mechanism, which often makes them unsuitable for use on long documents. To this end, we propose a hybrid, unsupervised, abstractive-extractive approach, in which we walk through a document, generating salient textual fragments representing its key points. We then select the most important sentences of the document by choosing the most similar sentences to the generated texts, calculated using BERTScore. We evaluate the efficacy of generating and using salient textual fragments to guide extractive summarization on documents from the biomedical and general scientific domains. We compare the performance between long and short documents using different generative text models, which are finetuned to generate relevant queries or document titles. We show that our hybrid approach out-performs existing unsupervised methods, as well as state-of-the-art supervised methods, despite not needing a vast amount of labelled training data.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Text Summarization arXiv GenCompareSum ROUGE-1 39.96 # 26
ROUGE-2 15.15 # 21
ROUGE-L 36.19 # 18
Text Summarization CORD-19 GenCompareSum ROUGE-1 41.02 # 1
ROUGE-2 13.79 # 1
ROUGE-L 37.25 # 1
Text Summarization Pubmed GenCompareSum ROUGE-1 42.10 # 23
ROUGE-2 16.51 # 19
ROUGE-L 38.25 # 16
Text Summarization S2ORC GenCompareSum ROUGE-1 43.39 # 1
ROUGE-2 16.84 # 1
ROUGE-L 39.82 # 1

Methods


No methods listed for this paper. Add relevant methods here