no code implementations • 23 Nov 2021 • Darshil Doshi, Tianyu He, Andrey Gromov
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.
no code implementations • 27 Jun 2022 • Tianyu He, Darshil Doshi, Andrey Gromov
Good initialization is essential for training Deep Neural Networks (DNNs).
1 code implementation • 6 Jan 2023 • Andrey Gromov
We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''.
1 code implementation • 19 Oct 2023 • Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large.
no code implementations • 15 Feb 2024 • Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo
Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions.
1 code implementation • 26 Mar 2024 • Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.
no code implementations • 1 Apr 2024 • Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?