DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts... While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Linguistic Acceptability CoLA DistilBERT Accuracy 49.1% # 21
Sentiment Analysis IMDb DistilBERT Accuracy 92.82 # 17
Semantic Textual Similarity MRPC DistilBERT Accuracy 90.2% # 9
Natural Language Inference QNLI DistilBERT Accuracy 90.2% # 22
Question Answering Quora Question Pairs DistilBERT Accuracy 89.2% # 13
Natural Language Inference RTE DistilBERT Accuracy 62.9% # 22
Question Answering SQuAD1.1 dev DistilBERT EM 77.7 # 16
F1 85.8 # 18
Sentiment Analysis SST-2 Binary classification DistilBERT Accuracy 91.3 # 36
Semantic Textual Similarity STS Benchmark DistilBERT Pearson Correlation 0.907 # 11
Natural Language Inference WNLI DistilBERT Accuracy 44.4% # 13

Methods