Discriminative FineTuning is a finetuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative finetuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters $\theta$ at time step $t$ looks like the following (Ruder, 2016):
$$ \theta_{t} = \theta_{t1} − \eta\cdot\nabla_{\theta}J\left(\theta\right)$$
where $\eta$ is the learning rate and $\nabla_{\theta}J\left(\theta\right)$ is the gradient with regard to the model’s objective function. For discriminative finetuning, we split the parameters $\theta$ into {$\theta_{1}, \ldots, \theta_{L}$} where $\theta_{l}$ contains the parameters of the model at the $l$th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\eta_{1}, \ldots, \eta_{L}$} where $\theta_{l}$ where $\eta_{l}$ is the learning rate of the $l$th layer. The SGD update with discriminative finetuning is then:
$$ \theta_{t}^{l} = \theta_{t1}^{l}  \eta^{l}\cdot\nabla_{\theta^{l}}J\left(\theta\right) $$
The authors find that empirically it worked well to first choose the learning rate $\eta^{L}$ of the last layer by finetuning only the last layer and using $\eta^{l1}=\eta^{l}/2.6$ as the learning rate for lower layers.
Source: Universal Language Model Finetuning for Text ClassificationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  164  19.78% 
Text Generation  90  10.86% 
Question Answering  30  3.62% 
Pretrained Language Models  23  2.77% 
Text Classification  22  2.65% 
General Classification  19  2.29% 
Sentiment Analysis  18  2.17% 
Machine Translation  16  1.93% 
Knowledge Distillation  14  1.69% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 