Knowledge Distillation

1271 papers with code • 5 benchmarks • 4 datasets

Knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized.

Libraries

Use these libraries to find Knowledge Distillation models and implementations

Most implemented papers

Sequence-Level Knowledge Distillation

harvardnlp/seq2seq-attn EMNLP 2016

We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model).

FedMD: Heterogenous Federated Learning via Model Distillation

KarhouTam/FL-bench 8 Oct 2019

With 10 distinct participants, the final test accuracy of each model on average receives a 20% gain on top of what's possible without collaboration and is only a few percent lower than the performance each model would have obtained if all private datasets were pooled and made directly available for all participants.

ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks

XinshaoAmosWang/ProSelfLC-CVPR2021 CVPR 2021

Keywords: entropy minimisation, maximum entropy, confidence penalty, self knowledge distillation, label correction, label noise, semi-supervised learning, output regularisation

TernaryBERT: Distillation-aware Ultra-low Bit BERT

huawei-noah/Pretrained-Language-Model EMNLP 2020

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.

Anomaly Detection via Reverse Distillation from One-Class Embedding

hq-deng/RD4AD CVPR 2022

Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD). The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD.

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

DrSleep/multi-task-refinenet 13 Sep 2018

Deployment of deep learning models in robotics as sensory information extractors can be a daunting task to handle, even using generic GPU cards.

Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells

drsleep/nas-segm-pytorch CVPR 2019

While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks.

Network Pruning via Transformable Architecture Search

D-X-Y/NAS-Projects NeurIPS 2019

The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e. g., knowledge distillation, from the original networks.

On the Effect of Dropping Layers of Pre-trained Transformer Models

hsajjad/transformers 8 Apr 2020

Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments.