DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

NeurIPS 2019 Victor SanhLysandre DebutJulien ChaumondThomas Wolf

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK COMPARE
Linguistic Acceptability CoLA DistilBERT Accuracy 49.1% # 13
Semantic Textual Similarity MRPC DistilBERT Accuracy 90.2% # 5
Natural Language Inference QNLI DistilBERT Accuracy 90.2% # 13
Question Answering Quora Question Pairs DistilBERT Accuracy 89.2% # 11
Natural Language Inference RTE DistilBERT Accuracy 62.9% # 12
Question Answering SQuAD1.1 dev DistilBERT EM 77.7 # 12
Question Answering SQuAD1.1 dev DistilBERT F1 85.8 # 14
Sentiment Analysis SST-2 Binary classification DistilBERT Accuracy 92.7 # 15
Semantic Textual Similarity STS Benchmark DistilBERT Pearson Correlation 0.907 # 5
Natural Language Inference WNLI DistilBERT Accuracy 44.4% # 11