Revisiting the Stability of Stochastic Gradient Descent: A Tightness Analysis

1 Jan 2021 · Yikai Zhang, Samuel Bald, Wenjia Zhang, Vamsi Pritham Pingali, Chao Chen, Mayank Goswami ·

The technique of algorithmic stability has been used to capture the generalization power of several learning models, especially those trained with stochastic gradient descent (SGD). This paper investigates the tightness of the algorithmic stability bounds for SGD given by~\cite{hardt2016train}. We show that the analysis of~\cite{hardt2016train} is tight for convex objective functions, but loose for non-convex objective functions. In the non-convex case we provide a tighter upper bound on the stability (and hence generalization error), and provide evidence that it is asymptotically tight up to a constant factor. However, deep neural networks trained with SGD exhibit much better stability and generalization in practice than what is suggested by these (tight) bounds, namely, linear or exponential degradation with time for SGD with constant step size. We aim towards characterizing deep learning loss functions with good generalization guarantees, despite being trained using SGD with constant step size. We propose the Hessian Contractive (HC) condition, which specifies the contractivity of regions containing local minima in the neural network loss landscape. We provide empirical evidence that this condition holds for several loss functions, and provide theoretical evidence that the known tight SGD stability bounds for convex and non-convex loss functions can be circumvented by HC loss functions, thus partially explaining the generalization of deep neural networks.

PDF Abstract