no code implementations • 26 Oct 2023 • Greg Yang, James B. Simon, Jeremy Bernstein
The push to train ever larger neural networks has motivated the study of initialization and training at large network width.
no code implementations • 3 Oct 2023 • Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou
By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i. e., predicting optimal hyperparameters of wide neural networks from narrow ones.
no code implementations • 3 Aug 2023 • Greg Yang, Etai Littwin
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam?
no code implementations • 1 Feb 2023 • Soufiane Hayou, Greg Yang
We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken.
no code implementations • 3 May 2022 • Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang
We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$.
3 code implementations • 7 Mar 2022 • Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters.
1 code implementation • 4 Nov 2021 • Subhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini, Hao Cheng, Greg Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao
We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
no code implementations • ICLR 2022 • Greg Yang, Michael Santacroce, Edward J Hu
While a popular limit of infinite-width neural networks, the Neural Tangent Kernel (NTK) often exhibits performance gaps from finite-width neural networks on standard datasets, due to lack of feature learning.
no code implementations • 1 Jul 2021 • Etai Littwin, Omid Saremi, Shuangfei Zhai, Vimal Thilak, Hanlin Goh, Joshua M. Susskind, Greg Yang
We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck.
1 code implementation • 7 Jun 2021 • Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang, Ashish Kapoor, Aleksander Madry
We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation.
no code implementations • 8 May 2021 • Greg Yang, Etai Littwin
To facilitate this proof, we develop a graphical notation for Tensor Programs.
4 code implementations • 30 Nov 2020 • Greg Yang, Edward J. Hu
However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT.
no code implementations • 22 Sep 2020 • Greg Yang
FIP and these results hold for any neural architecture.
2 code implementations • 25 Jun 2020 • Greg Yang
We prove that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity.
1 code implementation • 26 Apr 2020 • Edward J. Hu, Adith Swaminathan, Hadi Salman, Greg Yang
Robustness against image perturbations bounded by a $\ell_p$ ball have been well-studied in recent literature.
1 code implementation • NeurIPS 2020 • Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang
{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}.
4 code implementations • NeurIPS 2020 • Hadi Salman, Ming-Jie Sun, Greg Yang, Ashish Kapoor, J. Zico Kolter
We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks.
1 code implementation • ICML 2020 • Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, Jerry Li
Randomized smoothing is the current state-of-the-art defense with provable robustness against $\ell_2$ adversarial attacks.
1 code implementation • NeurIPS 2019 • Greg Yang
Wide neural networks with random weights and biases are Gaussian processes, as observed by Neal (1995) for shallow networks, and more recently by Lee et al.~(2018) and Matthews et al.~(2018) for deep fully-connected networks, as well as by Novak et al.~(2019) and Garriga-Alonso et al.~(2019) for deep convolutional networks.
2 code implementations • 28 Oct 2019 • Greg Yang
Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks.
no code implementations • 25 Sep 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington
We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.
no code implementations • 5 Sep 2019 • Justin Chen, Christopher Eur, Greg Yang, Mengyuan Zhang
Function classes are collections of Boolean functions on a finite set, which are fundamental objects of study in theoretical computer science.
1 code implementation • 24 Jul 2019 • Greg Yang, Hadi Salman
Are neural networks biased toward simple functions?
3 code implementations • NeurIPS 2019 • Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, huan zhang, Ilya Razenshteyn, Sebastien Bubeck
In this paper, we employ adversarial training to improve the performance of randomized smoothing.
no code implementations • ICLR 2019 • Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).
3 code implementations • NeurIPS 2019 • Hadi Salman, Greg Yang, huan zhang, Cho-Jui Hsieh, Pengchuan Zhang
This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification.
no code implementations • ICLR 2019 • Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz
We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.
no code implementations • 13 Feb 2019 • Greg Yang
Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks.
1 code implementation • 12 Feb 2019 • Matthew Hausknecht, Ricky Loynd, Greg Yang, Adith Swaminathan, Jason D. Williams
Interactive Fiction (IF) games are complex textual decision making problems.
no code implementations • 25 Jan 2019 • Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington
We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.
no code implementations • 11 Oct 2018 • Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).
no code implementations • ICLR 2018 • Greg Yang, Sam S. Schoenholz
Using the obtained mean field theory, we are able to track surprisingly well how VV at initialization time affects training and test time performance on MNIST after a set number of epochs: the level sets of test/train set accuracies coincide with the level sets of the expectations of certain gradient norms or of metric expressivity (as defined in \cite{yang_meanfield_2017}), a measure of expansion in a random neural network.
no code implementations • NeurIPS 2017 • Greg Yang, Samuel S. Schoenholz
Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward.
no code implementations • 9 Jan 2017 • Greg Yang
In computational complexity, a complexity class is given by a set of problems or functions, and a basic challenge is to show separations of complexity classes $A \not= B$ especially when $A$ is known to be a subset of $B$.
no code implementations • 9 Nov 2016 • Greg Yang, Alexander M. Rush
The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and memory access is performed by linear smoothing in key space.
no code implementations • 28 Feb 2016 • Greg Yang
We found the right configuration of LANTM to outperform the baseline in all of our experiments.
no code implementations • 12 Oct 2014 • Greg Yang
In addition, most of the results, of this paper and of Kuyper and Terwijn, do not apply to individual languages with a finite number of unary predicates.