Knowledge Distillation

Shrink and Fine-Tune

Introduced by Shleifer et al. in Pre-trained Summarization Distillation

Shrink and Fine-Tune, or SFT, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \in L'$ is copied fully from $L$. For example, when creating a BART student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher’s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\mathcal{L}_{Data}$.

Source: Pre-trained Summarization Distillation

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 16 15.69%
Large Language Model 11 10.78%
Instruction Following 8 7.84%
Question Answering 7 6.86%
Translation 5 4.90%
GSM8K 4 3.92%
Reinforcement Learning (RL) 4 3.92%
Retrieval 4 3.92%
Machine Translation 3 2.94%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories