AdamW

Introduced by Loshchilov et al. in Decoupled Weight Decay Regularization

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:

$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$

while AdamW adjusts the weight decay term to appear in the gradient update:

$$ \theta_{t+1, i} = \theta_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$

Source: Decoupled Weight Decay Regularization

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	17	7.87%
Image Classification	11	5.09%
Sentence	10	4.63%
Document Classification	10	4.63%
Question Answering	10	4.63%
Object Detection	7	3.24%
Classification	7	3.24%
Text Classification	6	2.78%
Natural Language Inference	6	2.78%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Stochastic Optimization