STABILITY AND CONVERGENCE THEORY FOR LEARNING RESNET: A FULL CHARACTERIZATION

25 Sep 2019  ·  Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, Tie-Yan Liu ·

ResNet structure has achieved great success since its debut. In this paper, we study the stability of learning ResNet. Specifically, we consider the ResNet block $h_l = \phi(h_{l-1}+\tau\cdot g(h_{l-1}))$ where $\phi(\cdot)$ is ReLU activation and $\tau$ is a scalar. We show that for standard initialization used in practice, $\tau =1/\Omega(\sqrt{L})$ is a sharp value in characterizing the stability of forward/backward process of ResNet, where $L$ is the number of residual blocks. Specifically, stability is guaranteed for $\tau\le 1/\Omega(\sqrt{L})$ while conversely forward process explodes when $\tau>L^{-\frac{1}{2}+c}$ for a positive constant $c$. Moreover, if ResNet is properly over-parameterized, we show for $\tau \le 1/\tilde{\Omega}(\sqrt{L})$ gradient descent is guaranteed to find the global minima \footnote{We use $\tilde{\Omega}(\cdot)$ to hide logarithmic factor.}, which significantly enlarges the range of $\tau\le 1/\tilde{\Omega}(L)$ that admits global convergence in previous work. We also demonstrate that the over-parameterization requirement of ResNet only weakly depends on the depth, which corroborates the advantage of ResNet over vanilla feedforward network. Empirically, with $\tau\le1/\sqrt{L}$, deep ResNet can be easily trained even without normalization layer. Moreover, adding $\tau=1/\sqrt{L}$ can also improve the performance of ResNet with normalization layer.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here