no code implementations • 7 Jun 2023 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD).
no code implementations • 29 Sep 2022 • Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin
Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss.
no code implementations • 30 Jun 2022 • Libin Zhu, Parthe Pandit, Mikhail Belkin
In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization.
1 code implementation • 24 May 2022 • Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models.
no code implementations • 24 May 2022 • Libin Zhu, Chaoyue Liu, Mikhail Belkin
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.
no code implementations • ICLR 2022 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent.
no code implementations • NeurIPS 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width.
no code implementations • 29 Feb 2020 • Chaoyue Liu, Libin Zhu, Mikhail Belkin
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks.