On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

1 Feb 2024  ·  Huan Li, Zhouchen Lin ·

Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{T^{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of the optimization variable, $T$ is the iteration number, and $C$ is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Since $\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{T^{1/4}})$ rate of SGD in the ideal case of $\|\nabla f(x)\|_1=\varTheta(\sqrt{d}\|\nabla f(x)\|_2)$.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods