TinyBERT: Distilling BERT for Natural Language Understanding

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

PDF Abstract Findings of 2020 PDF Findings of 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Linguistic Acceptability CoLA TinyBERT Accuracy 43.3% # 36
Linguistic Acceptability CoLA Dev TinyBERT (M=6;d' =768;d'i=3072) Accuracy 54 # 4
Semantic Textual Similarity MRPC TinyBERT Accuracy 86.4% # 31
Semantic Textual Similarity MRPC Dev TinyBERT (M=6;d'=768;d'i=3072) Accuracy 86.3 # 2
Natural Language Inference MultiNLI TinyBERT Matched 82.5 # 32
Mismatched 81.8 # 23
Natural Language Inference MultiNLI Dev TinyBERT (M=6;d'=768;d'i=3072) Matched 84.5 # 1
Mismatched 84.5 # 1
Natural Language Inference QNLI TinyBERT Accuracy 87.7% # 38
Paraphrase Identification Quora Question Pairs TinyBERT F1 71.3 # 14
Natural Language Inference RTE TinyBERT Accuracy 62.9% # 66
Question Answering SQuAD1.1 dev TinyBERT (M=6;d' =768;d'i=3072) EM 79.7 # 15
F1 87.5 # 16
Question Answering SQuAD2.0 dev TinyBERT (M=6;d' =768;d'i=3072) F1 73.4 # 13
EM 69.9 # 12
Sentiment Analysis SST-2 Binary classification TinyBERT Accuracy 92.6 # 45
Semantic Textual Similarity STS Benchmark TinyBERT Pearson Correlation 0.799 # 28

Methods