Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrixvalued variables, we are able to reconstruct a lowrank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized KullbackLeibler divergence. For an $n \times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$.
Instead of defining the optimization algorithm in terms of absolute step sizes {$\alpha_t$}$_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\rho_t$}$_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the rootmeansquare of its components, lowerbounded by a small constant $\epsilon_2$. The reason for this lower bound is to allow zeroinitialized parameters to escape 0.
Proposed hyperparameters are: $\epsilon_{1} = 10^{30}$, $\epsilon_{2} = 10^{3}$, $d=1$, $p_{t} = \min\left(10^{2}, \frac{1}{\sqrt{t}}\right)$, $\hat{\beta}_{2_{t}} = 1  t^{0.8}$.
Source: Adafactor: Adaptive Learning Rates with Sublinear Memory CostPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Question Answering  38  7.92% 
Language Modelling  37  7.71% 
Text Generation  32  6.67% 
Machine Translation  17  3.54% 
Natural Language Understanding  16  3.33% 
Abstractive Text Summarization  15  3.13% 
Pretrained Language Models  13  2.71% 
Semantic Parsing  12  2.50% 
Code Generation  9  1.88% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 