Search Results for author: Eshaan Nichani

Found 8 papers, 3 papers with code

How Transformers Learn Causal Structure with Gradient Descent

no code implementations22 Feb 2024 Eshaan Nichani, Alex Damian, Jason D. Lee

The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens.

In-Context Learning

Learning Hierarchical Polynomials with Three-Layer Neural Networks

no code implementations23 Nov 2023 ZiHao Wang, Eshaan Nichani, Jason D. Lee

Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time.

Fine-Tuning Language Models with Just Forward Passes

2 code implementations NeurIPS 2023 Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.

In-Context Learning Multiple-choice

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

1 code implementation30 Sep 2022 Alex Damian, Eshaan Nichani, Jason D. Lee

Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

1 code implementation8 Jun 2022 Eshaan Nichani, Yu Bai, Jason D. Lee

Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.

Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks

no code implementations19 Oct 2020 Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler

We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.

Image Classification Open-Ended Question Answering +1

Do Deeper Convolutional Networks Perform Better?

no code implementations28 Sep 2020 Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler

Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold leads to a decrease in test error.

Learning Theory

On Alignment in Deep Linear Neural Networks

no code implementations13 Mar 2020 Adityanarayanan Radhakrishnan, Eshaan Nichani, Daniel Bernstein, Caroline Uhler

We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018.

Cannot find the paper you are looking for? You can Submit a new open access paper.