Shrink and Fine-Tune

Introduced by Shleifer et al. in Pre-trained Summarization Distillation

Shrink and Fine-Tune, or SFT, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \in L'$ is copied fully from $L$. For example, when creating a BART student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher’s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\mathcal{L}_{Data}$.

Source: Pre-trained Summarization Distillation

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	16	15.69%
Large Language Model	11	10.78%
Instruction Following	8	7.84%
Question Answering	7	6.86%
Translation	5	4.90%
GSM8K	4	3.92%
Reinforcement Learning (RL)	4	3.92%
Retrieval	4	3.92%
Machine Translation	3	2.94%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Knowledge Distillation

Distillation