LinkBERT: Pretraining Language Models with Document Links

ACL 2022  ·  Michihiro Yasunaga, Jure Leskovec, Percy Liang ·

Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

PDF Abstract ACL 2022 PDF ACL 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Named Entity Recognition (NER) BC2GM BioLinkBERT (large) F1 85.18 # 8
Named Entity Recognition (NER) BC5CDR BioLinkBERT (large) F1 90.22 # 6
Named Entity Recognition (NER) BC5CDR-chemical BioLinkBERT (large) F1 94.04 # 9
Named Entity Recognition (NER) BC5CDR-disease BioLinkBERT (large) F1 86.39 # 5
Question Answering BioASQ BioLinkBERT (large) Accuracy 94.8 # 1
Question Answering BioASQ BioLinkBERT (base) Accuracy 91.4 # 3
Sentence Similarity BIOSSES BioLinkBERT (base) Pearson Correlation 93.25 # 2
Semantic Similarity BIOSSES BioLinkBERT (large) Pearson Correlation 0.9363 # 1
Semantic Similarity BIOSSES BioLinkBERT (base) Pearson Correlation 0.9325 # 2
Sentence Similarity BIOSSES BioLinkBERT (large) Pearson Correlation 93.63 # 1
Text Classification BLURB BioLinkBERT (base) F1 84.35 # 2
Question Answering BLURB BioLinkBERT (large) Accuracy 83.5 # 1
Text Classification BLURB BioLinkBERT (large) F1 84.88 # 1
Question Answering BLURB BioLinkBERT (base) Accuracy 80.81 # 2
Relation Extraction ChemProt BioLinkBERT (large) F1 79.98 # 3
Micro F1 79.98 # 1
Relation Extraction DDI BioLinkBERT (large) Micro F1 83.35 # 1
F1 83.35 # 1
Medical Relation Extraction DDI extraction 2013 corpus BioLinkBERT (large) F1 83.35 # 1
PICO EBM PICO BioLinkBERT (large) Macro F1 word level 74.19 # 1
PICO EBM PICO BioLinkBERT (base) Macro F1 word level 73.97 # 2
Relation Extraction GAD BioLinkBERT (large) Micro F1 84.90 # 1
F1 84.90 # 1
Document Classification HOC BioLinkBERT (large) F1 88.1 # 1
Micro F1 84.87 # 2
Named Entity Recognition (NER) JNLPBA BioLinkBERT (large) F1 80.06 # 6
Question Answering MedQA BioLinkBERT (base) Accuracy 40.0 # 16
Question Answering MRQA LinkBERT (large) Average F1 81.0 # 1
Named Entity Recognition (NER) NCBI-disease BioLinkBERT (large) F1 88.76 # 12
Question Answering NewsQA LinkBERT (large) F1 72.6 # 2
Question Answering PubMedQA BioLinkBERT (large) Accuracy 72.2 # 17
Question Answering PubMedQA BioLinkBERT (base) Accuracy 70.2 # 18
Question Answering SQuAD1.1 LinkBERT (large) EM 87.45 # 16
F1 92.7 # 19
Question Answering TriviaQA LinkBERT (large) F1 78.2 # 3

Methods