PowerSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization

25 Sep 2019 · Jun Liu, Beitong Zhou, Weigao Sun, Ruijuan Chen, Claire J. Tomlin, Ye Yuan ·

In this paper, we propose a novel technique for improving the stochastic gradient descent (SGD) method to train deep networks, which we term \emph{PowerSGD}. The proposed PowerSGD method simply raises the stochastic gradient to a certain power $\gamma\in[0,1]$ during iterations and introduces only one additional parameter, namely, the power exponent $\gamma$ (when $\gamma=1$, PowerSGD reduces to SGD). We further propose PowerSGD with momentum, which we term \emph{PowerSGDM}, and provide convergence rate analysis on both PowerSGD and PowerSGDM methods. Experiments are conducted on popular deep learning models and benchmark datasets. Empirical results show that the proposed PowerSGD and PowerSGDM obtain faster initial training speed than adaptive gradient methods, comparable generalization ability with SGD, and improved robustness to hyper-parameter selection and vanishing gradients. PowerSGD is essentially a gradient modifier via a nonlinear transformation. As such, it is orthogonal and complementary to other techniques for accelerating gradient-based optimization.

PDF Abstract