Empirically Characterizing Overparameterization Impact on Convergence

A long-held conventional wisdom states that larger models train more slowly when using gradient descent. This work challenges this widely-held belief, showing that larger models can potentially train faster despite the increasing computational requirements of each training step. In particular, we study the effect of network structure (depth and width) on halting time and show that larger models---wider models in particular---take fewer training steps to converge. We design simple experiments to quantitatively characterize the effect of overparametrization on weight space traversal. Results show that halting time improves when growing model's width for three different applications, and the improvement comes from each factor: The distance from initialized weights to converged weights shrinks with a power-law-like relationship, the average step size grows with a power-law-like relationship, and gradient vectors become more aligned with each other during traversal.

PDF Abstract
No code implementations yet. Submit your code now



  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here