The simpler the better: vanilla sgd revisited

1 Jan 2021 · Yueyao Yu, Jie Wang, Wenye Li, Yin Zhang ·

The stochastic gradient descent (SGD) method, first proposed in 1950's, has been the foundation for deep-neural-network (DNN) training with numerous enhancements including adding a momentum or adaptively selecting learning rates, or using both strategies and more. A conventional wisdom for SGD is that the learning rate must be eventually made small in order to reach sufficiently good approximate solutions. Another widely held view is that the vanilla SGD is out of fashion in comparison to many of its modern variations. In this work, we provide a contrarian claim that, when training over-parameterized DNNs, the vanilla SGD can still compete well with, and oftentimes outperform, its more recent variations by simply using learning rates significantly larger than commonly used values. We provide some theoretical justifications to this claim, and also present computational evidence, across multiple tasks including image classification, speech recognition and natural language processing, in support of this claim.

PDF Abstract