Knowledge Distillation

439 papers with code • 1 benchmarks • 1 datasets

Knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized.

Libraries

Use these libraries to find Knowledge Distillation models and implementations

Datasets


Most implemented papers

Distilling the Knowledge in a Neural Network

labmlai/annotated_deep_learning_paper_implementations 9 Mar 2015

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions.

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

google-research/bert ICLR 2020

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

huggingface/transformers NeurIPS 2019

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

adityac94/Grad_CAM_plus_plus 30 Oct 2017

Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

PaddlePaddle/PaddleSpeech ICLR 2021

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

UKPLab/sentence-transformers EMNLP 2020

The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence.

Sequence-Level Knowledge Distillation

harvardnlp/seq2seq-attn EMNLP 2016

We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model).

TinyBERT: Distilling BERT for Natural Language Understanding

huawei-noah/Pretrained-Language-Model Findings of the Association for Computational Linguistics 2020

To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models.

FitNets: Hints for Thin Deep Nets

adri-romsor/FitNets 19 Dec 2014

In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.