Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrixvalued variables, we are able to reconstruct a lowrank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized KullbackLeibler divergence. For an $n \times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$.
Instead of defining the optimization algorithm in terms of absolute step sizes {$\alpha_t$}$_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\rho_t$}$_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the rootmeansquare of its components, lowerbounded by a small constant $\epsilon_2$. The reason for this lower bound is to allow zeroinitialized parameters to escape 0.
Proposed hyperparameters are: $\epsilon_{1} = 10^{30}$, $\epsilon_{2} = 10^{3}$, $d=1$, $p_{t} = \min\left(10^{2}, \frac{1}{\sqrt{t}}\right)$, $\hat{\beta}_{2_{t}} = 1  t^{0.8}$.
Source: Adafactor: Adaptive Learning Rates with Sublinear Memory CostPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  91  9.43% 
Question Answering  69  7.15% 
Text Generation  49  5.08% 
Retrieval  27  2.80% 
Machine Translation  25  2.59% 
Natural Language Understanding  23  2.38% 
Abstractive Text Summarization  20  2.07% 
Semantic Parsing  20  2.07% 
Natural Language Inference  17  1.76% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 