no code implementations • 22 Feb 2024 • Eshaan Nichani, Alex Damian, Jason D. Lee
The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens.
no code implementations • 23 Nov 2023 • ZiHao Wang, Eshaan Nichani, Jason D. Lee
Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time.
2 code implementations • NeurIPS 2023 • Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.
1 code implementation • 30 Sep 2022 • Alex Damian, Eshaan Nichani, Jason D. Lee
Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.
1 code implementation • 8 Jun 2022 • Eshaan Nichani, Yu Bai, Jason D. Lee
Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.
no code implementations • 19 Oct 2020 • Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler
We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.
no code implementations • 28 Sep 2020 • Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler
Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold leads to a decrease in test error.
no code implementations • 13 Mar 2020 • Adityanarayanan Radhakrishnan, Eshaan Nichani, Daniel Bernstein, Caroline Uhler
We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018.