no code implementations • 15 Sep 2023 • Alokendu Mazumder, Rishabh Sabharwal, Manan Tayal, Bhartendu Kumar, Punit Rathore
Lastly, (iii) we also demonstrate that our derived constant step size has better abilities in reducing the gradient norms, and empirically, we show that despite the accumulation of a few past gradients, the key driver for convergence in Adam is the non-increasing step sizes.