no code implementations • 14 Feb 2025 • Tao Tao, Darshil Doshi, Dayal Singh Kalra, Tianyu He, Maissam Barkeshli
Our analysis reveals that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a, c$).
no code implementations • 5 Jun 2024 • Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov
Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest.
1 code implementation • 4 Jun 2024 • Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov
In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks.
1 code implementation • 19 Oct 2023 • Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large.
no code implementations • 27 Jun 2022 • Tianyu He, Darshil Doshi, Andrey Gromov
Good initialization is essential for training Deep Neural Networks (DNNs).
1 code implementation • 23 Nov 2021 • Darshil Doshi, Tianyu He, Andrey Gromov
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.