Stochastic Optimization


Introduced by Loshchilov et al. in Decoupled Weight Decay Regularization

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:

$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$

while AdamW adjusts the weight decay term to appear in the gradient update:

$$ \theta_{t+1, i} = \theta_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$

Source: Decoupled Weight Decay Regularization


Paper Code Results Date Stars



Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign