SGD Can Converge to Local Maxima

ICLR 2022  ·  Liu Ziyin, Botao Li, James B Simon, Masahito Ueda ·

Stochastic gradient descent (SGD) is widely used for the nonlinear, nonconvex problem of training deep neural networks, but its behavior remains poorly understood. Many theoretical works have studied SGD, but they commonly rely on restrictive and unrealistic assumptions about the nature of its noise. In this work, we construct example optimization problems illustrating that, if these assumptions are relaxed, SGD can exhibit many strange behaviors that run counter to the established wisdom of the field. Our constructions show that (1) SGD can converge to local maxima, (2) SGD might only escape saddle points arbitrarily slowly, (3) SGD can prefer sharp minima over flat ones, and (4) AMSGrad can converge to local maxima. We realize our most surprising results in a simple neural network-like construction, suggesting their relevance to deep learning.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods