Knowledge Distillation
1271 papers with code • 5 benchmarks • 4 datasets
Knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized.
Libraries
Use these libraries to find Knowledge Distillation models and implementationsMost implemented papers
Sequence-Level Knowledge Distillation
We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model).
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
Attention plays a critical role in human visual experience.
FedMD: Heterogenous Federated Learning via Model Distillation
With 10 distinct participants, the final test accuracy of each model on average receives a 20% gain on top of what's possible without collaboration and is only a few percent lower than the performance each model would have obtained if all private datasets were pooled and made directly available for all participants.
ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks
Keywords: entropy minimisation, maximum entropy, confidence penalty, self knowledge distillation, label correction, label noise, semi-supervised learning, output regularisation
TernaryBERT: Distillation-aware Ultra-low Bit BERT
Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.
Anomaly Detection via Reverse Distillation from One-Class Embedding
Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD). The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD.
Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations
Deployment of deep learning models in robotics as sensory information extractors can be a daunting task to handle, even using generic GPU cards.
Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells
While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks.
Network Pruning via Transformable Architecture Search
The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e. g., knowledge distillation, from the original networks.
On the Effect of Dropping Layers of Pre-trained Transformer Models
Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments.