DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.
Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighterPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Sentiment Analysis | 12 | 7.50% |
Language Modelling | 12 | 7.50% |
Classification | 10 | 6.25% |
Question Answering | 9 | 5.63% |
Text Classification | 8 | 5.00% |
Knowledge Distillation | 7 | 4.38% |
Quantization | 6 | 3.75% |
General Classification | 6 | 3.75% |
Model Compression | 4 | 2.50% |