How to Train BERT with an Academic Budget

EMNLP 2021  ·  Peter Izsak, Moshe Berchansky, Omer Levy ·

While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Linguistic Acceptability CoLA 24hBERT Accuracy 57.1 # 28
Semantic Textual Similarity MRPC 24hBERT Accuracy 87.5% # 22
Natural Language Inference MultiNLI 24hBERT Matched 84.4 # 25
Mismatched 83.8 # 19
Natural Language Inference QNLI 24hBERT Accuracy 90.6 # 31
Question Answering Quora Question Pairs 24hBERT Accuracy 70.7 # 19
Natural Language Inference RTE 24hBERT Accuracy 57.7% # 50
Sentiment Analysis SST-2 Binary classification 24hBERT Accuracy 93.0 # 39
Semantic Textual Similarity STS Benchmark 24hBERT Pearson Correlation 0.820 # 24