Distillation

# Shrink and Fine-Tune

Introduced by Shleifer et al. in Pre-trained Summarization Distillation

Shrink and Fine-Tune, or SFT, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \in L'$ is copied fully from $L$. For example, when creating a BART student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher’s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\mathcal{L}_{Data}$.

#### Papers

Paper Code Results Date Stars