Stochastic Optimization

Introduced by Loshchilov et al. in Decoupled Weight Decay Regularization

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:

$$g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$

$$\theta_{t+1, i} = \theta_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$

#### Papers

Paper Code Results Date Stars