Knowledge Distillation

1271 papers with code • 5 benchmarks • 4 datasets

Knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized.

Benchmarks

Add a Result

These leaderboards are used to track progress in Knowledge Distillation

Dataset	Best Model	Compare
ImageNet	KD++(T: regnety-16GF S:ViT-B)	See all
CIFAR-100	shufflenet-v2(T:resnet-32x4, S:shufflenet-v2)	See all
MS COCO	ADLIK-Faster (T: Faster R-CNN vit-base S: Faster R-CNN deit-small)	See all
COCO 2017 val	ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet50))	See all
PASCAL VOC	LSHFM (T: ResNet101 S: ResNet50)	See all

Libraries

Use these libraries to find Knowledge Distillation models and implementations

yoshitomo-matsubara/torchdistill

21 papers

1,260

faceonlive/ai-research

4 papers

131

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Sequence-Level Knowledge Distillation

harvardnlp/seq2seq-attn • • EMNLP 2016

We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model).

Paper
Code

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

szagoruyko/attention-transfer • • 12 Dec 2016

Attention plays a critical role in human visual experience.

Paper
Code

FedMD: Heterogenous Federated Learning via Model Distillation

KarhouTam/FL-bench • • 8 Oct 2019

With 10 distinct participants, the final test accuracy of each model on average receives a 20% gain on top of what's possible without collaboration and is only a few percent lower than the performance each model would have obtained if all private datasets were pooled and made directly available for all participants.

Paper
Code

ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks

XinshaoAmosWang/ProSelfLC-CVPR2021 • • CVPR 2021

Keywords: entropy minimisation, maximum entropy, confidence penalty, self knowledge distillation, label correction, label noise, semi-supervised learning, output regularisation

Paper
Code

TernaryBERT: Distillation-aware Ultra-low Bit BERT

huawei-noah/Pretrained-Language-Model • • EMNLP 2020

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.

Paper
Code

Anomaly Detection via Reverse Distillation from One-Class Embedding

hq-deng/RD4AD • • CVPR 2022

Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD). The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD.

Paper
Code

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

DrSleep/multi-task-refinenet • • 13 Sep 2018

Deployment of deep learning models in robotics as sensory information extractors can be a daunting task to handle, even using generic GPU cards.

Paper
Code

Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells

drsleep/nas-segm-pytorch • • CVPR 2019

While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks.

Paper
Code

Network Pruning via Transformable Architecture Search

D-X-Y/NAS-Projects • • NeurIPS 2019

The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e. g., knowledge distillation, from the original networks.

Paper
Code

On the Effect of Dropping Layers of Pre-trained Transformer Models

hsajjad/transformers • • 8 Apr 2020

Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments.

Paper
Code

Knowledge Distillation

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result