AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:
$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$
while AdamW adjusts the weight decay term to appear in the gradient update:
$$ \theta_{t+1, i} = \theta_{t, i}  \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$
Source: Decoupled Weight Decay RegularizationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  15  7.39% 
Image Classification  10  4.93% 
Document Classification  10  4.93% 
Question Answering  10  4.93% 
Sentence  9  4.43% 
Classification  7  3.45% 
Text Classification  6  2.96% 
Natural Language Inference  6  2.96% 
Object Detection  6  2.96% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 