no code implementations • 4 Apr 2024 • Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
In this paper, we explore the idea of training large language models (LLMs) over highly compressed text.
no code implementations • 11 Dec 2023 • Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, Maxwell L. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, Noah Fiedel
To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times.
no code implementations • 8 Nov 2023 • C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch, Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah Constant, Peter J. Liu, Roman Novak, Yundi Qian, Noah Fiedel, Jascha Sohl-Dickstein
We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.
no code implementations • 25 Sep 2023 • Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
In this work, we seek ways to reproduce and study training stability and instability at smaller scales.
no code implementations • 10 Oct 2022 • Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability).
no code implementations • 11 Jul 2022 • Lechao Xiao, Jeffrey Pennington
Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data.
no code implementations • 15 Jun 2022 • Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
no code implementations • 15 Jun 2022 • Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein
We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.
no code implementations • 30 May 2022 • Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue M. Lu, Jeffrey Pennington
As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes.
no code implementations • 14 May 2022 • Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington
By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation.
no code implementations • NeurIPS 2021 • Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington
A significant obstacle in the development of robust machine learning models is \emph{covariate shift}, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.
BIG-bench Machine Learning Out-of-Distribution Generalization +1
no code implementations • 16 Nov 2021 • Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington
A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same.
BIG-bench Machine Learning Out-of-Distribution Generalization +2
no code implementations • ICLR 2022 • Gabriel Mel, Jeffrey Pennington
In contrast to standard statistical wisdom, modern learning algorithms typically find their best performance in the overparameterized regime in which the model has many more parameters than needed to fit the training data.
no code implementations • NeurIPS 2021 • Lechao Xiao, Jeffrey Pennington
By computing an eigen-decomposition of the infinite-width limits (aka Neural Kernels) of these architectures, we characterize how inductive biases (locality, weight-sharing, pooling, etc) and the breaking of spurious symmetries can affect the performance of these learning systems.
no code implementations • ICLR 2021 • Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek
This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.
no code implementations • NeurIPS 2020 • Ben Adlam, Jeffrey Pennington
Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function.
1 code implementation • 14 Oct 2020 • Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek
This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue.
no code implementations • 14 Oct 2020 • Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz
In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.
no code implementations • ICML 2020 • Ben Adlam, Jeffrey Pennington
Modern deep learning models employ considerably more parameters than required to fit the training data.
no code implementations • NeurIPS 2020 • Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein
We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods.
no code implementations • NeurIPS 2020 • Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington
Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes.
1 code implementation • 18 Jun 2020 • Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein
Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large.
no code implementations • ICLR 2020 • Wei Hu, Lechao Xiao, Jeffrey Pennington
The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance.
no code implementations • ICML 2020 • Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz
A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data.
no code implementations • 2 Dec 2019 • Ben Adlam, Jake Levinson, Jeffrey Pennington
In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity.
no code implementations • 25 Sep 2019 • Lechao Xiao, Jeffrey Pennington, Sam Schoenholz
In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably.
no code implementations • 25 Sep 2019 • Ben Adlam, Jake Levinson, Jeffrey Pennington
One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions.
no code implementations • 25 Sep 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington
We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.
no code implementations • ICLR 2019 • Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).
no code implementations • ICLR 2019 • Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz
We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.
1 code implementation • NeurIPS 2019 • Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington
A longstanding goal in deep learning research has been to precisely characterize training and generalization.
no code implementations • 25 Jan 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington
We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.
no code implementations • NeurIPS 2018 • Jeffrey Pennington, Pratik Worah
An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent.
no code implementations • 11 Oct 2018 • Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).
no code implementations • ICML 2018 • Minmin Chen, Jeffrey Pennington, Samuel S. Schoenholz
We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory.
3 code implementations • ICML 2018 • Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington
In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme.
1 code implementation • 27 Feb 2018 • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude.
no code implementations • ICLR 2018 • Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models.
no code implementations • 9 Feb 2018 • Ryan P. Adams, Jeffrey Pennington, Matthew J. Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, James Saunderson
However, naive eigenvalue estimation is computationally expensive even when the matrix can be represented; in many of these situations the matrix is so large as to only be available implicitly via products with vectors.
no code implementations • NeurIPS 2017 • Jeffrey Pennington, Pratik Worah
Neural network configurations with random weights play an important role in the analysis of deep learning.
no code implementations • NeurIPS 2017 • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed.
7 code implementations • ICLR 2018 • Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network.
no code implementations • 18 Oct 2017 • Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics.
no code implementations • ICML 2017 • Jeffrey Pennington, Yasaman Bahri
We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions.
no code implementations • NeurIPS 2015 • Jeffrey Pennington, Felix Xinnan X. Yu, Sanjiv Kumar
Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials.
4 code implementations • EMNLP 2014 • Jeffrey Pennington, Richard Socher, Christopher Manning
Ranked #14 on Only Connect Walls Dataset Task 1 (Grouping) on OCW (using extra training data)