no code implementations • 27 Jul 2023 • Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise.
1 code implementation • 3 Jul 2023 • Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora
In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e. g., pre-trained language models).
1 code implementation • 11 Oct 2022 • Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings.
1 code implementation • 20 May 2022 • Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD.
1 code implementation • NeurIPS 2021 • Zhiyuan Li, Sadhika Malladi, Sanjeev Arora
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets.
no code implementations • ICLR 2021 • Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora
This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification?