RMSprop can converge with proper hyper-parameter

ICLR 2021  ·  Naichen Shi, Dawei Li, Mingyi Hong, Ruoyu Sun ·

Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop can converge with proper choice of hyper-parameters under certain conditions. More specifically, we prove that when the hyper-parameter $\beta_2$ is large enough, the random shuffling version of RMSprop converges to a bounded region in general, and converges to a stationary point in the interpolation regime. It is worth mentioning that our results do not depend on "bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for RMSprop. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSProp. Finally, based on our theory, we conjecture that there is a critical threshold in practice, such that RMSprop generates reasonably good results only if $\beta_2\ge {\sf {th}}$. We provide empirical evidence about such a phase transition in our numerical experiments.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods