Search Results for author: Jeffrey Pennington

Found 46 papers, 7 papers with code

Training LLMs over Neurally Compressed Text

no code implementations4 Apr 2024 Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text.

Second-order regression models exhibit progressive sharpening to the edge of stability

no code implementations10 Oct 2022 Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington

Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability).

regression

Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

no code implementations11 Jul 2022 Lechao Xiao, Jeffrey Pennington

Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data.

Open-Ended Question Answering

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

no code implementations15 Jun 2022 Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.

Computational Efficiency

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

no code implementations15 Jun 2022 Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.

Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression

no code implementations30 May 2022 Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue M. Lu, Jeffrey Pennington

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes.

regression

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

no code implementations14 May 2022 Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation.

Vocal Bursts Intensity Prediction

Overparameterization Improves Robustness to Covariate Shift in High Dimensions

no code implementations NeurIPS 2021 Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington

A significant obstacle in the development of robust machine learning models is \emph{covariate shift}, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.

BIG-bench Machine Learning Out-of-Distribution Generalization +1

Covariate Shift in High-Dimensional Random Feature Regression

no code implementations16 Nov 2021 Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington

A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.

BIG-bench Machine Learning Out-of-Distribution Generalization +2

Anisotropic Random Feature Regression in High Dimensions

no code implementations ICLR 2022 Gabriel Mel, Jeffrey Pennington

In contrast to standard statistical wisdom, modern learning algorithms typically find their best performance in the overparameterized regime in which the model has many more parameters than needed to fit the training data.

regression Vocal Bursts Intensity Prediction

What Breaks The Curse of Dimensionality in Deep Learning?

no code implementations NeurIPS 2021 Lechao Xiao, Jeffrey Pennington

By computing an eigen-decomposition of the infinite-width limits (aka Neural Kernels) of these architectures, we characterize how inductive biases (locality, weight-sharing, pooling, etc) and the breaking of spurious symmetries can affect the performance of these learning systems.

Open-Ended Question Answering

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

no code implementations ICLR 2021 Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.

General Classification Multi-class Classification +1

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

no code implementations NeurIPS 2020 Ben Adlam, Jeffrey Pennington

Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function.

Ensemble Learning Learning Theory

Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

1 code implementation14 Oct 2020 Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.

General Classification Multi-class Classification +1

Temperature check: theory and practice for training models with softmax-cross-entropy losses

no code implementations14 Oct 2020 Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz

In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.

Sentiment Analysis

Finite Versus Infinite Neural Networks: an Empirical Study

no code implementations NeurIPS 2020 Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods.

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

no code implementations NeurIPS 2020 Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes.

Exact posterior distributions of wide Bayesian neural networks

1 code implementation18 Jun 2020 Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large.

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

no code implementations ICLR 2020 Wei Hu, Lechao Xiao, Jeffrey Pennington

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance.

Disentangling Trainability and Generalization in Deep Neural Networks

no code implementations ICML 2020 Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data.

Gaussian Processes

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

no code implementations2 Dec 2019 Ben Adlam, Jake Levinson, Jeffrey Pennington

In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity.

Disentangling Trainability and Generalization in Deep Learning

no code implementations25 Sep 2019 Lechao Xiao, Jeffrey Pennington, Sam Schoenholz

In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably.

Gaussian Processes

A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions

no code implementations25 Sep 2019 Ben Adlam, Jake Levinson, Jeffrey Pennington

One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions.

Vocal Bursts Intensity Prediction

The Dynamics of Signal Propagation in Gated Recurrent Neural Networks

no code implementations25 Sep 2019 Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

A Mean Field Theory of Batch Normalization

no code implementations ICLR 2019 Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

no code implementations25 Jan 2019 Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

no code implementations NeurIPS 2018 Jeffrey Pennington, Pratik Worah

An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent.

Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks

no code implementations ICML 2018 Minmin Chen, Jeffrey Pennington, Samuel S. Schoenholz

We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory.

Language Modelling

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

3 code implementations ICML 2018 Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme.

The Emergence of Spectral Universality in Deep Networks

1 code implementation27 Feb 2018 Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude.

Sensitivity and Generalization in Neural Networks: an Empirical Study

no code implementations ICLR 2018 Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models.

Data Augmentation Image Classification

Estimating the Spectral Density of Large Implicit Matrices

no code implementations9 Feb 2018 Ryan P. Adams, Jeffrey Pennington, Matthew J. Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, James Saunderson

However, naive eigenvalue estimation is computationally expensive even when the matrix can be represented; in many of these situations the matrix is so large as to only be available implicitly via products with vectors.

Nonlinear random matrix theory for deep learning

no code implementations NeurIPS 2017 Jeffrey Pennington, Pratik Worah

Neural network configurations with random weights play an important role in the analysis of deep learning.

Memorization

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

no code implementations NeurIPS 2017 Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed.

Deep Neural Networks as Gaussian Processes

7 code implementations ICLR 2018 Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network.

Bayesian Inference Gaussian Processes

A Correspondence Between Random Neural Networks and Statistical Field Theory

no code implementations18 Oct 2017 Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics.

Geometry of Neural Network Loss Surfaces via Random Matrix Theory

no code implementations ICML 2017 Jeffrey Pennington, Yasaman Bahri

We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions.

Memorization

Spherical Random Features for Polynomial Kernels

no code implementations NeurIPS 2015 Jeffrey Pennington, Felix Xinnan X. Yu, Sanjiv Kumar

Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials.

General Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.