UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

28 May 2021  ยท  Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang ยท

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Phrase Ranking KP20k Wiki+RoBERTa P@5K 100.0 # 1
P@50K 98.5 # 1
Keyphrase Extraction KP20k PKE Recall 57.1 # 5
F1@10 12.6 # 7
Keyphrase Extraction KP20k Spacy Recall 59.5 # 4
F1@10 15.3 # 4
Keyphrase Extraction KP20k StanfordNLP Recall 51.7 # 7
F1@10 13.9 # 6
Keyphrase Extraction KP20k AutoPhrase Recall 62.9 # 3
F1@10 18.2 # 3
Keyphrase Extraction KP20k TopMine Recall 53.3 # 6
F1@10 15.0 # 5
Keyphrase Extraction KP20k UCPhrase Recall 72.9 # 2
F1@10 19.7 # 1
Phrase Tagging KP20k AutoPhrase Precision 55.2 # 3
Recall 45.2 # 3
F1 49.7 # 3
Phrase Tagging KP20k Wiki+RoBERTa Precision 58.1 # 2
Recall 64.2 # 2
F1 61.0 # 2
Phrase Tagging KP20k TopMine Precision 39.8 # 4
Recall 41.4 # 4
F1 40.6 # 4
Phrase Ranking KP20k TopMine P@5K 81.5 # 3
P@50K 78.0 # 3
Phrase Ranking KP20k UCPhrase P@5K 96.5 # 2
P@50K 96.5 # 2
Keyphrase Extraction KP20k Wiki+RoBERTa Recall 73.0 # 1
F1@10 19.2 # 2
Phrase Tagging KP20k UCPhrase Precision 69.9 # 1
Recall 78.3 # 1
F1 73.9 # 1
Keyphrase Extraction KPTimes Wiki+RoBERTa Recall 64.5 # 3
F1@10 9.4 # 3
Keyphrase Extraction KPTimes UCPhrase Recall 83.4 # 1
F1@10 10.9 # 1
Phrase Ranking KPTimes Wiki+RoBERTa P@5K 99.0 # 1
P@50K 96.5 # 1
Phrase Tagging KPTimes AutoPhrase Precision 44.2 # 3
Recall 47.7 # 3
F1 45.9 # 3
Phrase Tagging KPTimes Wiki+RoBERTa Precision 60.9 # 2
Recall 65.6 # 2
F1 63.2 # 2
Phrase Tagging KPTimes TopMine Precision 32.0 # 4
Recall 36.3 # 4
F1 34.0 # 4
Keyphrase Extraction KPTimes AutoPhrase Recall 77.8 # 2
F1@10 10.3 # 2
Phrase Tagging KPTimes UCPhrase Precision 69.1 # 1
Recall 78.9 # 1
F1 73.5 # 1
Keyphrase Extraction KPTimes TopMine Recall 63.4 # 4
F1@10 8.5 # 4
Phrase Ranking KPTimes AutoPhrase P@5K 96.5 # 2
P@50K 95.5 # 2
Phrase Ranking KPTimes TopMine P@5K 85.5 # 4
P@50K 71.0 # 4
Phrase Ranking KPTimes UCPhrase P@5K 96.5 # 2
P@50K 95.5 # 2

Methods


No methods listed for this paper. Add relevant methods here