Stochastic Optimization

Adaptive Smooth Optimizer

Introduced by Lu in AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio

AdaSmooth is a stochastic optimization technique that allows for per-dimension learning rate method for SGD. It is an extension of Adagrad and AdaDelta that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.

Given the window size $M$, the effective ratio is calculated by

$$e_t = \frac{s_t}{n_t}= \frac{| x_t - x_{t-M}|}{\sum_{i=0}^{M-1} | x_{t-i} - x_{t-1-i}|}\ = \frac{| \sum_{i=0}^{M-1} \Delta x_{t-1-i}|}{\sum_{i=0}^{M-1} | \Delta x_{t-1-i}|}.$$

Given the effective ratio, the scaled smoothing constant is obtained by:

$$c_t = ( \rho_2- \rho_1) \times e_t + (1-\rho_2),$$

The running average $E\left[g^{2}\right]_{t}$ at time step $t$ then depends only on the previous average and current gradient:

$$ E\left[g^{2}\right]_{t} = c_t^2 \odot g_{t}^2 + \left(1-c_t^2 \right)\odot E[g^2]_{t-1} $$

Usually $\rho_1$ is set to around $0.5$ and $\rho_2$ is set to around 0.99. The update step the follows:

$$ \Delta x_t = -\frac{\eta}{\sqrt{E\left[g^{2}\right]_{t} + \epsilon}} \odot g_{t}, $$

which is incorporated into the final update:

$$x_{t+1} = x_{t} + \Delta x_t.$$

The main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.

Source: AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio

Papers


Paper Code Results Date Stars

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories