Stochastic Optimization

AdaShift is a type of adaptive stochastic optimizer that decorrelates $v_{t}$ and $g_{t}$ in Adam by temporal shifting, i.e., using temporally shifted gradient $g_{t−n}$ to calculate $v_{t}$. The authors argue that an inappropriate correlation between gradient $g_{t}$ and the second-moment term $v_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.

The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:

$$ g_{t} = \nabla{f_{t}}\left(\theta_{t}\right) $$

$$ m_{t} = \sum^{n-1}_{i=0}\beta^{i}_{1}g_{t-i}/\sum^{n-1}_{i=0}\beta^{i}_{1} $$

Then for $i=1$ to $M$:

$$ v_{t}\left[i\right] = \beta_{2}v_{t-1}\left[i\right] + \left(1-\beta_{2}\right)\phi\left(g^{2}_{t-n}\left[i\right]\right) $$

$$ \theta_{t}\left[i\right] = \theta_{t-1}\left[i\right] - \alpha_{t}/\sqrt{v_{t}\left[i\right]}\cdot{m_{t}\left[i\right]} $$

Source: AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods


Paper Code Results Date Stars


Task Papers Share
Image Classification 1 100.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign