Search Results for author: Eshaan Nichani

Found 8 papers, 3 papers with code

How Transformers Learn Causal Structure with Gradient Descent

no code implementations • 22 Feb 2024 • Eshaan Nichani, Alex Damian, Jason D. Lee

The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens.

In-Context Learning

Paper
Add Code

Learning Hierarchical Polynomials with Three-Layer Neural Networks

no code implementations • 23 Nov 2023 • ZiHao Wang, Eshaan Nichani, Jason D. Lee

Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time.

Paper
Add Code

Fine-Tuning Language Models with Just Forward Passes

2 code implementations • NeurIPS 2023 • Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.

In-Context Learning Multiple-choice

977

Paper
Code

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

1 code implementation • 30 Sep 2022 • Alex Damian, Eshaan Nichani, Jason D. Lee

Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.

Paper
Code

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

1 code implementation • 8 Jun 2022 • Eshaan Nichani, Yu Bai, Jason D. Lee

Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.

Paper
Code

Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks

no code implementations • 19 Oct 2020 • Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler

We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.

Image Classification Open-Ended Question Answering +1

Paper
Add Code

Do Deeper Convolutional Networks Perform Better?

no code implementations • 28 Sep 2020 • Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler

Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold leads to a decrease in test error.

Learning Theory

Paper
Add Code

On Alignment in Deep Linear Neural Networks

no code implementations • 13 Mar 2020 • Adityanarayanan Radhakrishnan, Eshaan Nichani, Daniel Bernstein, Caroline Uhler

We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.