SGD Can Converge to Local Maxima
Stochastic gradient descent (SGD) is widely used for the nonlinear, nonconvex problem of training deep neural networks, but its behavior remains poorly understood. Many theoretical works have studied SGD, but they commonly rely on restrictive and unrealistic assumptions about the nature of its noise. In this work, we construct example optimization problems illustrating that, if these assumptions are relaxed, SGD can exhibit many strange behaviors that run counter to the established wisdom of the field. Our constructions show that (1) SGD can converge to local maxima, (2) SGD might only escape saddle points arbitrarily slowly, (3) SGD can prefer sharp minima over flat ones, and (4) AMSGrad can converge to local maxima. We realize our most surprising results in a simple neural network-like construction, suggesting their relevance to deep learning.
PDF Abstract