Amortized Nesterov's Momentum: Robust and Lightweight Momentum for Deep Learning

25 Sep 2019 · Kaiwen Zhou, Yanghua Jin, Qinghua Ding, James Cheng ·

Stochastic Gradient Descent (SGD) with Nesterov's momentum is a widely used optimizer in deep learning, which is observed to have excellent generalization performance. However, due to the large stochasticity, SGD with Nesterov's momentum is not robust, i.e., its performance may deviate significantly from the expectation. In this work, we propose Amortized Nesterov's Momentum, a special variant of Nesterov's momentum which has more robust iterates, faster convergence in the early stage and higher efficiency. Our experimental results show that this new momentum achieves similar (sometimes better) generalization performance with little-to-no tuning. In the convex case, we provide optimal convergence rates for our new methods and discuss how the theorems explain the empirical results.

PDF Abstract