AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:
$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$
while AdamW adjusts the weight decay term to appear in the gradient update:
$$ \theta_{t+1, i} = \theta_{t, i}  \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$
Source: Decoupled Weight Decay RegularizationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Question Answering  7  7.69% 
Image Classification  5  5.49% 
Language Modelling  5  5.49% 
Natural Language Inference  4  4.40% 
Abstractive Text Summarization  4  4.40% 
General Classification  4  4.40% 
Document Classification  3  3.30% 
Document Summarization  3  3.30% 
Object Detection  3  3.30% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 