Discriminative FineTuning is a finetuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative finetuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters $\theta$ at time step $t$ looks like the following (Ruder, 2016):
$$ \theta_{t} = \theta_{t1} − \eta\cdot\nabla_{\theta}J\left(\theta\right)$$
where $\eta$ is the learning rate and $\nabla_{\theta}J\left(\theta\right)$ is the gradient with regard to the model’s objective function. For discriminative finetuning, we split the parameters $\theta$ into {$\theta_{1}, \ldots, \theta_{L}$} where $\theta_{l}$ contains the parameters of the model at the $l$th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\eta_{1}, \ldots, \eta_{L}$} where $\theta_{l}$ where $\eta_{l}$ is the learning rate of the $l$th layer. The SGD update with discriminative finetuning is then:
$$ \theta_{t}^{l} = \theta_{t1}^{l}  \eta^{l}\cdot\nabla_{\theta^{l}}J\left(\theta\right) $$
The authors find that empirically it worked well to first choose the learning rate $\eta^{L}$ of the last layer by finetuning only the last layer and using $\eta^{l1}=\eta^{l}/2.6$ as the learning rate for lower layers.
Source: Universal Language Model Finetuning for Text ClassificationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  79  10.55% 
Large Language Model  38  5.07% 
Text Generation  34  4.54% 
Retrieval  22  2.94% 
Question Answering  20  2.67% 
Sentence  17  2.27% 
Prompt Engineering  16  2.14% 
RAG  15  2.00% 
InContext Learning  15  2.00% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 