Stochastic Optimization

Introduced by Zhou et al. in AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

AdaShift is a type of adaptive stochastic optimizer that decorrelates $v_{t}$ and $g_{t}$ in Adam by temporal shifting, i.e., using temporally shifted gradient $g_{t−n}$ to calculate $v_{t}$. The authors argue that an inappropriate correlation between gradient $g_{t}$ and the second-moment term $v_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.

$$g_{t} = \nabla{f_{t}}\left(\theta_{t}\right)$$

$$m_{t} = \sum^{n-1}_{i=0}\beta^{i}_{1}g_{t-i}/\sum^{n-1}_{i=0}\beta^{i}_{1}$$

Then for $i=1$ to $M$:

$$v_{t}\left[i\right] = \beta_{2}v_{t-1}\left[i\right] + \left(1-\beta_{2}\right)\phi\left(g^{2}_{t-n}\left[i\right]\right)$$

$$\theta_{t}\left[i\right] = \theta_{t-1}\left[i\right] - \alpha_{t}/\sqrt{v_{t}\left[i\right]}\cdot{m_{t}\left[i\right]}$$

#### Papers

Paper Code Results Date Stars