Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Named Entity Recognition (NER) BC2GM PubMedBERT uncased F1 84.52 # 10
Question Answering BioASQ PubMedBERT uncased Accuracy 87.56 # 5
Sentence Similarity BIOSSES PubMedBERT uncased Pearson Correlation 92.3 # 3
Text Classification BLURB PubMedBERT (uncased; abstracts) F1 82.32 # 3
Question Answering BLURB PubMedBERT (uncased; abstracts) Accuracy 71.7 # 3
Relation Extraction ChemProt PubMedBERT uncased Micro F1 77.24 # 2
Relation Extraction DDI PubMedBERT uncased Micro F1 82.36 # 2
Drug–drug Interaction Extraction DDI extraction 2013 corpus PubMedBERT F1 0.8236 # 3
Micro F1 82.36 # 3
Participant Intervention Comparison Outcome Extraction EBM-NLP PubMedBERT uncased F1 73.38 # 2
PICO EBM PICO PubMedBERT uncased Macro F1 word level 73.38 # 3
Relation Extraction GAD PubMedBERT uncased Micro F1 82.34 # 2
Document Classification HOC PubMedBERT uncased Micro F1 82.32 # 3
Named Entity Recognition (NER) JNLPBA PubMedBERT uncased F1 79.1 # 9
Named Entity Recognition (NER) NCBI-disease PubMedBERT uncased F1 87.82 # 15
Question Answering PubMedQA PubMedBERT uncased Accuracy 55.84 # 20