AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:
$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$
while AdamW adjusts the weight decay term to appear in the gradient update:
$$ \theta_{t+1, i} = \theta_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$
Source: Decoupled Weight Decay RegularizationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 19 | 6.62% |
Image Classification | 15 | 5.23% |
Decoder | 11 | 3.83% |
Sentence | 10 | 3.48% |
Document Classification | 10 | 3.48% |
Question Answering | 10 | 3.48% |
Text Classification | 7 | 2.44% |
Object Detection | 7 | 2.44% |
Deep Learning | 7 | 2.44% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |