How to Train BERT with an Academic Budget
While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
PDF Abstract EMNLP 2021 PDF EMNLP 2021 AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Linguistic Acceptability | CoLA | 24hBERT | Accuracy | 57.1 | # 33 | |
Semantic Textual Similarity | MRPC | 24hBERT | Accuracy | 87.5% | # 25 | |
Natural Language Inference | MultiNLI | 24hBERT | Matched | 84.4 | # 30 | |
Mismatched | 83.8 | # 21 | ||||
Natural Language Inference | QNLI | 24hBERT | Accuracy | 90.6 | # 32 | |
Question Answering | Quora Question Pairs | 24hBERT | Accuracy | 70.7 | # 19 | |
Natural Language Inference | RTE | 24hBERT | Accuracy | 57.7% | # 79 | |
Sentiment Analysis | SST-2 Binary classification | 24hBERT | Accuracy | 93.0 | # 44 | |
Semantic Textual Similarity | STS Benchmark | 24hBERT | Pearson Correlation | 0.820 | # 27 |