Source: https://arxiv.org/abs/1901.11150
Adaptive gradientbased optimizers such as AdaGrad and Adam are among the defacto methods of choice in modern machine learning.These methods tune the learning rate for each parameter during the optimization process using cumulative secondorder statistics. These methods provide superior convergence properties and are very attractive in large scale applications due to their moderate time and space requirements which are linear in the number of parameters.
However, the recent advances in natural language processing such as BERT and GPT2 show that models with 10<sup>8</sup> to 10<sup>10</sup> parameters, trained with adaptive optimization methods, achieve stateoftheart results. In such cases, the memory overhead of the optimizer can restrict the size of the model that can be used as well as the batch size, both of which can have a dramatic effect on the quality of the final model.
Here we construct a new adaptive optimization method that retains most of the benefits of standard perparameter adaptivity while significantly reducing memory overhead.
We observe that in standard neural networks that certain entries of the stochastic gradients have (on average) similar values, and exhibit what we refer to as an activation pattern. For example, in gradients of embedding layers of deep networks, an entire row (or column) is either zero or nonzero. Similarly, in intermediate layers we often observe that gradients associated with the same unit are of similar order of magnitude. In these cases, a similar phenomenon is observed in the secondorder statistics maintained by adaptive methods. With this key observation, to reduce the memory overhead of the optimizer our method takes in a cover set of the parameters. Cover sets are typically selected in practice such that parameters in each of the sets have second order statistics of similar magnitude. Our method is general enough that it can easily be extended to arbitrary cover sets. For parameters of deep networks that are organized as a collection of tensors, we form a cover consisting of slices of codimension one for each tensor. Thus, for an m x n parameter matrix, the cover consists of rows and columns of the matrix. The memory requirements therefore drop from mxn to merely m+n. For a parameter tensor of rank p, with dimensions n<sub>1</sub> ... n<sub>p</sub>, the reduction in memory consumption is even more pronounced, dropping from product of all the dimensions to the sum of all dimensions. This virtually eliminates the memory overhead associated with maintaining the adaptive learning rates!
Another practical aspect worthy of note is that our method does not require an external hand engineered learning rate decay schedule but instead relies on the per parameter adaptivity that is natural to its update rule which makes it easier to tune. We provide details in the supplementary section of the paper.
learning_rate = lr_constant * tf.minimum(1.0, (warm_up_step / global_step) ** p)
We typically set warm_up_step
as 5% of overall steps. Initially, the norm of the preconditioned gradient is much larger than norm of the weights. Learning rate warmup allows us to heuristically fix this scale mismatch.
We make use accumulated gradient squares for the decay. This means that each coordinate gets its own natural decay based on the scales of the gradients over time. Hence, users need not put in an external learning rate decay schedule. Moreover, we found in our experiments with translation and language models that this approach is superior to a handtuned learning rate decay schedules which is typically combined with exponential moving averages of the gradient squares.
Having said that if users want to add exponential moving averages instead of the standard accumulated gradient squares  It's easy to modify the optimizer implementation to switch to exponential moving averages.
For rank > 1:
from  to 

current_accumulator += grad * grad  current_accumulator = beta * current_accumulator + (1beta) * grad * grad 
For rank <= 1:
from  to 

current_accumulator = tf.assign_add(accumulator, grad * grad)  current_accumulator = tf.assign(accumulator, beta * accumulator + (1beta) * (grad * grad)) 
It's useful to run polyak averaging of the parameters. These parameters are then used in inference / serving. Using the averaged parameters instead of the last iterate typically improves the overall performance of the model.
An alternative to polyak averaging which does not make use of extra memory is to decay the learning rate from the constant to zero for the last 10% of the steps of your entire training run, we term the phase a cooldown phase of the model. As training makes smaller and smaller steps the final iterate can be thought of as an average iterate.
Source: Memory Efficient Adaptive OptimizationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  2  33.33% 
Machine Translation  2  33.33% 
TextToSpeech Synthesis  1  16.67% 
Text Generation  1  16.67% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 