AdaShift is a type of adaptive stochastic optimizer that decorrelates $v_{t}$ and $g_{t}$ in Adam by temporal shifting, i.e., using temporally shifted gradient $g_{t−n}$ to calculate $v_{t}$. The authors argue that an inappropriate correlation between gradient $g_{t}$ and the secondmoment term $v_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of nonconvergence of Adam.
The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:
$$ g_{t} = \nabla{f_{t}}\left(\theta_{t}\right) $$
$$ m_{t} = \sum^{n1}_{i=0}\beta^{i}_{1}g_{ti}/\sum^{n1}_{i=0}\beta^{i}_{1} $$
Then for $i=1$ to $M$:
$$ v_{t}\left[i\right] = \beta_{2}v_{t1}\left[i\right] + \left(1\beta_{2}\right)\phi\left(g^{2}_{tn}\left[i\right]\right) $$
$$ \theta_{t}\left[i\right] = \theta_{t1}\left[i\right]  \alpha_{t}/\sqrt{v_{t}\left[i\right]}\cdot{m_{t}\left[i\right]} $$
Source: AdaShift: Decorrelation and Convergence of Adaptive Learning Rate MethodsPaper  Code  Results  Date  Stars 

Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 