Transformers

DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.

Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Sentiment Analysis 19 7.63%
Classification 18 7.23%
Text Classification 18 7.23%
Language Modelling 17 6.83%
Question Answering 12 4.82%
Sentence 9 3.61%
Quantization 7 2.81%
Natural Language Understanding 6 2.41%
Model Compression 6 2.41%

Components


Component Type
BERT
Language Models

Categories