Search Results for author: Vaishnavh Nagarajan

Found 18 papers, 5 papers with code

The pitfalls of next-token prediction

1 code implementation11 Mar 2024 Gregor Bachmann, Vaishnavh Nagarajan

As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly.

What do larger image classifiers memorise?

no code implementations9 Oct 2023 Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels.

Image Classification Knowledge Distillation +2

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

no code implementations7 Oct 2023 Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite

We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference.

In-Context Learning

Think before you speak: Training Language Models With Pause Tokens

no code implementations3 Oct 2023 Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token.

GSM8K Question Answering

On student-teacher deviations in distillation: does it pay to disobey?

no code implementations NeurIPS 2023 Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network.

Knowledge Distillation

Explaining generalization in deep learning: progress and fundamental limits

no code implementations17 Oct 2021 Vaishnavh Nagarajan

This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error?

Generalization Bounds Learning Theory

Assessing Generalization of SGD via Disagreement

no code implementations ICLR 2022 Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, J. Zico Kolter

We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data.

A Learning Theoretic Perspective on Local Explainability

no code implementations ICLR 2021 Jeffrey Li, Vaishnavh Nagarajan, Gregory Plumb, Ameet Talwalkar

In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations.

BIG-bench Machine Learning Interpretable Machine Learning +1

Understanding the Failure Modes of Out-of-Distribution Generalization

1 code implementation ICLR 2021 Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur

Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time.

Image Classification Out-of-Distribution Generalization

Provably Safe PAC-MDP Exploration Using Analogies

1 code implementation7 Jul 2020 Melrose Roderick, Vaishnavh Nagarajan, J. Zico Kolter

A key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration (needed to attain good performance on the task) with safety (needed to avoid catastrophic failure).

reinforcement-learning Reinforcement Learning (RL) +1

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

no code implementations ICLR 2019 Vaishnavh Nagarajan, J. Zico Kolter

The ability of overparameterized deep networks to generalize well has been linked to the fact that stochastic gradient descent (SGD) finds solutions that lie in flat, wide minima in the training loss -- minima where the output of the network is resilient to small random noise added to its parameters.

Generalization Bounds

Uniform convergence may be unable to explain generalization in deep learning

1 code implementation NeurIPS 2019 Vaishnavh Nagarajan, J. Zico Kolter

Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence.

Generalization Bounds

Generalization in Deep Networks: The Role of Distance from Initialization

no code implementations7 Jan 2019 Vaishnavh Nagarajan, J. Zico Kolter

Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network?

Revisiting Adversarial Risk

no code implementations7 Jun 2018 Arun Sai Suggala, Adarsh Prasad, Vaishnavh Nagarajan, Pradeep Ravikumar

Based on the modified definition, we show that there is no trade-off between adversarial and standard accuracies; there exist classifiers that are robust and achieve high standard accuracy.

Image Classification

Lifelong Learning in Costly Feature Spaces

no code implementations30 Jun 2017 Maria-Florina Balcan, Avrim Blum, Vaishnavh Nagarajan

An important long-term goal in machine learning systems is to build learning agents that, like humans, can learn many tasks over their lifetime, and moreover use information from these tasks to improve their ability to do so efficiently.

Gradient descent GAN optimization is locally stable

1 code implementation NeurIPS 2017 Vaishnavh Nagarajan, J. Zico Kolter

Despite the growing prominence of generative adversarial networks (GANs), optimization in GANs is still a poorly understood topic.

Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems

no code implementations14 Nov 2016 Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, Colin White

We address this problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance.

Clustering Learning Theory

Cannot find the paper you are looking for? You can Submit a new open access paper.