Discriminative FineTuning is a finetuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative finetuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters $\theta$ at time step $t$ looks like the following (Ruder, 2016):
$$ \theta_{t} = \theta_{t1} − \eta\cdot\nabla_{\theta}J\left(\theta\right)$$
where $\eta$ is the learning rate and $\nabla_{\theta}J\left(\theta\right)$ is the gradient with regard to the model’s objective function. For discriminative finetuning, we split the parameters $\theta$ into {$\theta_{1}, \ldots, \theta_{L}$} where $\theta_{l}$ contains the parameters of the model at the $l$th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\eta_{1}, \ldots, \eta_{L}$} where $\theta_{l}$ where $\eta_{l}$ is the learning rate of the $l$th layer. The SGD update with discriminative finetuning is then:
$$ \theta_{t}^{l} = \theta_{t1}^{l}  \eta^{l}\cdot\nabla_{\theta^{l}}J\left(\theta\right) $$
The authors find that empirically it worked well to first choose the learning rate $\eta^{L}$ of the last layer by finetuning only the last layer and using $\eta^{l1}=\eta^{l}/2.6$ as the learning rate for lower layers.
Source: Universal Language Model Finetuning for Text ClassificationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  113  16.57% 
Text Generation  53  7.77% 
Question Answering  20  2.93% 
Large Language Model  19  2.79% 
Retrieval  15  2.20% 
Decision Making  14  2.05% 
Text Classification  13  1.91% 
Prompt Engineering  12  1.76% 
Natural Language Understanding  12  1.76% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 