Search Results for author: Jeffrey Pennington

Found 46 papers, 7 papers with code

Training LLMs over Neurally Compressed Text

no code implementations • 4 Apr 2024 • Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text.

Paper
Add Code

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

no code implementations • 11 Dec 2023 • Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, Maxwell L. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, Noah Fiedel

To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times.

Math

Paper
Add Code

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

no code implementations • 8 Nov 2023 • C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch, Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah Constant, Peter J. Liu, Roman Novak, Yundi Qian, Noah Fiedel, Jascha Sohl-Dickstein

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.

Language Modelling

Paper
Add Code

Small-scale proxies for large-scale Transformer training instabilities

no code implementations • 25 Sep 2023 • Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith

In this work, we seek ways to reproduce and study training stability and instability at smaller scales.

Paper
Add Code

Second-order regression models exhibit progressive sharpening to the edge of stability

no code implementations • 10 Oct 2022 • Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington

Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability).

regression

Paper
Add Code

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

no code implementations • 11 Jul 2022 • Lechao Xiao, Jeffrey Pennington

Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data.

Open-Ended Question Answering

Paper
Add Code

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

no code implementations • 15 Jun 2022 • Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.

Computational Efficiency

Paper
Add Code

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

no code implementations • 15 Jun 2022 • Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.

Paper
Add Code

Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression

no code implementations • 30 May 2022 • Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue M. Lu, Jeffrey Pennington

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes.

regression

Paper
Add Code

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

no code implementations • 14 May 2022 • Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation.

Vocal Bursts Intensity Prediction

Paper
Add Code

Overparameterization Improves Robustness to Covariate Shift in High Dimensions

no code implementations • NeurIPS 2021 • Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington

A significant obstacle in the development of robust machine learning models is \emph{covariate shift}, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.

BIG-bench Machine Learning Out-of-Distribution Generalization +1

Paper
Add Code

Covariate Shift in High-Dimensional Random Feature Regression

no code implementations • 16 Nov 2021 • Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington

A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.

BIG-bench Machine Learning Out-of-Distribution Generalization +2

Paper
Add Code

Anisotropic Random Feature Regression in High Dimensions

no code implementations • ICLR 2022 • Gabriel Mel, Jeffrey Pennington

In contrast to standard statistical wisdom, modern learning algorithms typically find their best performance in the overparameterized regime in which the model has many more parameters than needed to fit the training data.

regression Vocal Bursts Intensity Prediction

Paper
Add Code

What Breaks The Curse of Dimensionality in Deep Learning?

no code implementations • NeurIPS 2021 • Lechao Xiao, Jeffrey Pennington

By computing an eigen-decomposition of the infinite-width limits (aka Neural Kernels) of these architectures, we characterize how inductive biases (locality, weight-sharing, pooling, etc) and the breaking of spurious symmetries can affect the performance of these learning systems.

Open-Ended Question Answering

Paper
Add Code

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

no code implementations • ICLR 2021 • Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.

General Classification Multi-class Classification +1

Paper
Add Code

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

no code implementations • NeurIPS 2020 • Ben Adlam, Jeffrey Pennington

Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function.

Ensemble Learning Learning Theory

Paper
Add Code

Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

1 code implementation • 14 Oct 2020 • Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.

General Classification Multi-class Classification +1

1,360

Paper
Code

Temperature check: theory and practice for training models with softmax-cross-entropy losses

no code implementations • 14 Oct 2020 • Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz

In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.

Sentiment Analysis

Paper
Add Code

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

no code implementations • ICML 2020 • Ben Adlam, Jeffrey Pennington

Modern deep learning models employ considerably more parameters than required to fit the training data.

regression

Paper
Add Code

Finite Versus Infinite Neural Networks: an Empirical Study

no code implementations • NeurIPS 2020 • Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods.

Paper
Add Code

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

no code implementations • NeurIPS 2020 • Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes.

Paper
Add Code

Exact posterior distributions of wide Bayesian neural networks

1 code implementation • 18 Jun 2020 • Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large.

Paper
Code

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

no code implementations • ICLR 2020 • Wei Hu, Lechao Xiao, Jeffrey Pennington

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance.

Paper
Add Code

Disentangling Trainability and Generalization in Deep Neural Networks

no code implementations • ICML 2020 • Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data.

Gaussian Processes

Paper
Add Code

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

no code implementations • 2 Dec 2019 • Ben Adlam, Jake Levinson, Jeffrey Pennington

In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity.

Paper
Add Code

Disentangling Trainability and Generalization in Deep Learning

no code implementations • 25 Sep 2019 • Lechao Xiao, Jeffrey Pennington, Sam Schoenholz

In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably.

Gaussian Processes

Paper
Add Code

A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions

no code implementations • 25 Sep 2019 • Ben Adlam, Jake Levinson, Jeffrey Pennington

One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions.

Vocal Bursts Intensity Prediction

Paper
Add Code

The Dynamics of Signal Propagation in Gated Recurrent Neural Networks

no code implementations • 25 Sep 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

Paper
Add Code

Deep Bayesian Convolutional Networks with Many Channels are Gaussian Processes

no code implementations • ICLR 2019 • Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).

Gaussian Processes

Paper
Add Code

A Mean Field Theory of Batch Normalization

no code implementations • ICLR 2019 • Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.

Paper
Add Code

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

1 code implementation • NeurIPS 2019 • Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

A longstanding goal in deep learning research has been to precisely characterize training and generalization.

Gaussian Processes

2,219

Paper
Code

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

no code implementations • 25 Jan 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

Paper
Add Code

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

no code implementations • NeurIPS 2018 • Jeffrey Pennington, Pratik Worah

An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent.

Paper
Add Code

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

no code implementations • 11 Oct 2018 • Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).

Gaussian Processes

Paper
Add Code

Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks

no code implementations • ICML 2018 • Minmin Chen, Jeffrey Pennington, Samuel S. Schoenholz

We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory.

Language Modelling

Paper
Add Code

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

3 code implementations • ICML 2018 • Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme.

Paper
Code

The Emergence of Spectral Universality in Deep Networks

1 code implementation • 27 Feb 2018 • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude.

Paper
Code

Sensitivity and Generalization in Neural Networks: an Empirical Study

no code implementations • ICLR 2018 • Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models.

Data Augmentation Image Classification

Paper
Add Code

Estimating the Spectral Density of Large Implicit Matrices

no code implementations • 9 Feb 2018 • Ryan P. Adams, Jeffrey Pennington, Matthew J. Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, James Saunderson

However, naive eigenvalue estimation is computationally expensive even when the matrix can be represented; in many of these situations the matrix is so large as to only be available implicitly via products with vectors.

Paper
Add Code

Nonlinear random matrix theory for deep learning

no code implementations • NeurIPS 2017 • Jeffrey Pennington, Pratik Worah

Neural network configurations with random weights play an important role in the analysis of deep learning.

Memorization

Paper
Add Code

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

no code implementations • NeurIPS 2017 • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed.

Paper
Add Code

Deep Neural Networks as Gaussian Processes

7 code implementations • ICLR 2018 • Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network.

Bayesian Inference Gaussian Processes

189

Paper
Code

A Correspondence Between Random Neural Networks and Statistical Field Theory

no code implementations • 18 Oct 2017 • Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics.

Paper
Add Code

Geometry of Neural Network Loss Surfaces via Random Matrix Theory

no code implementations • ICML 2017 • Jeffrey Pennington, Yasaman Bahri

We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions.

Memorization

Paper
Add Code

Spherical Random Features for Polynomial Kernels

no code implementations • NeurIPS 2015 • Jeffrey Pennington, Felix Xinnan X. Yu, Sanjiv Kumar

Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials.

General Classification

Paper
Add Code

GloVe: Global Vectors for Word Representation

4 code implementations • EMNLP 2014 • Jeffrey Pennington, Richard Socher, Christopher Manning

Ranked #14 on Only Connect Walls Dataset Task 1 (Grouping) on OCW (using extra training data)

Document Classification Information Retrieval +3

6,694

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.