no code implementations • 19 Mar 2025 • Lisa Jin, Jianhao Ma, Zechun Liu, Andrey Gromov, Aaron Defazio, Lin Xiao
We develop a principled method for quantization-aware training (QAT) of large-scale machine learning models.
no code implementations • 12 Feb 2025 • Andrew Cohen, Andrey Gromov, Kaiyu Yang, Yuandong Tian
In this setting, the representations and the dynamics learned by the model are interpretable.
no code implementations • 13 Jun 2024 • Rylan Schaeffer, Victor Lecomte, Dhruv Bhandarkar Pai, Andres Carranza, Berivan Isik, Alyssa Unell, Mikail Khona, Thomas Yerxa, Yann Lecun, SueYeon Chung, Andrey Gromov, Ravid Shwartz-Ziv, Sanmi Koyejo
We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL.
no code implementations • 5 Jun 2024 • Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov
Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest.
1 code implementation • 4 Jun 2024 • Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov
In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks.
no code implementations • 1 Apr 2024 • Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?
3 code implementations • 26 Mar 2024 • Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.
no code implementations • 15 Feb 2024 • Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo
Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions.
1 code implementation • 19 Oct 2023 • Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large.
1 code implementation • 6 Jan 2023 • Andrey Gromov
We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''.
no code implementations • 27 Jun 2022 • Tianyu He, Darshil Doshi, Andrey Gromov
Good initialization is essential for training Deep Neural Networks (DNNs).
1 code implementation • 23 Nov 2021 • Darshil Doshi, Tianyu He, Andrey Gromov
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.