no code implementations • 7 Feb 2024 • Daniel Beaglehole, Ioannis Mitliagkas, Atish Agarwala
Prior works have identified that the gram matrices of the weights in trained neural networks of general architectures are proportional to the average gradient outer product of the model, in a statement known as the Neural Feature Ansatz (NFA).
no code implementations • 19 Jan 2024 • Yann N. Dauphin, Atish Agarwala, Hossein Mobahi
We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
no code implementations • 30 Nov 2023 • Vincent Roulet, Atish Agarwala, Fabian Pedregosa
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022).
no code implementations • 17 Feb 2023 • Atish Agarwala, Yann N. Dauphin
We show that in a simplified setting, SAM dynamically induces a stabilization related to the edge of stability (EOS) phenomenon observed in large learning rate gradient descent.
no code implementations • 10 Oct 2022 • Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability).
no code implementations • 19 Jul 2022 • Atish Agarwala, Samuel S. Schoenholz
Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute.
no code implementations • ICLR 2021 • Atish Agarwala, Abhimanyu Das, Brendan Juba, Rina Panigrahy, Vatsal Sharan, Xin Wang, Qiuyi Zhang
Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different?
no code implementations • 14 Oct 2020 • Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz
In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.
no code implementations • 15 May 2020 • Atish Agarwala, Abhimanyu Das, Rina Panigrahy, Qiuyi Zhang
We present experimental evidence that the many-body gravitational force function is easier to learn with ReLU networks as compared to networks with exponential activations.