Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrixvalued variables, we are able to reconstruct a lowrank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized KullbackLeibler divergence. For an $n \times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$.
Instead of defining the optimization algorithm in terms of absolute step sizes {$\alpha_t$}$_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\rho_t$}$_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the rootmeansquare of its components, lowerbounded by a small constant $\epsilon_2$. The reason for this lower bound is to allow zeroinitialized parameters to escape 0.
Proposed hyperparameters are: $\epsilon_{1} = 10^{30}$, $\epsilon_{2} = 10^{3}$, $d=1$, $p_{t} = \min\left(10^{2}, \frac{1}{\sqrt{t}}\right)$, $\hat{\beta}_{2_{t}} = 1  t^{0.8}$.
Source: Adafactor: Adaptive Learning Rates with Sublinear Memory CostPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  100  9.12% 
Question Answering  55  5.01% 
Decoder  49  4.47% 
Sentence  43  3.92% 
Text Generation  42  3.83% 
Retrieval  36  3.28% 
Translation  29  2.64% 
Machine Translation  24  2.19% 
Natural Language Understanding  19  1.73% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 