Search Results for author: Greg Yang

Found 37 papers, 15 papers with code

A Spectral Condition for Feature Learning

no code implementations26 Oct 2023 Greg Yang, James B. Simon, Jeremy Bernstein

The push to train ever larger neural networks has motivated the study of initialization and training at large network width.

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

no code implementations3 Oct 2023 Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou

By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i. e., predicting optimal hyperparameters of wide neural networks from narrow ones.

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

no code implementations3 Aug 2023 Greg Yang, Etai Littwin

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam?

Width and Depth Limits Commute in Residual Networks

no code implementations1 Feb 2023 Soufiane Hayou, Greg Yang

We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken.

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

no code implementations3 May 2022 Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang

We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$.

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

3 code implementations7 Mar 2022 Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters.

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

1 code implementation4 Nov 2021 Subhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini, Hao Cheng, Greg Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao

We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.

Few-Shot Learning Natural Language Understanding

Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features

no code implementations ICLR 2022 Greg Yang, Michael Santacroce, Edward J Hu

While a popular limit of infinite-width neural networks, the Neural Tangent Kernel (NTK) often exhibits performance gaps from finite-width neural networks on standard datasets, due to lack of feature learning.

3DB: A Framework for Debugging Computer Vision Models

1 code implementation7 Jun 2021 Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang, Ashish Kapoor, Aleksander Madry

We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation.

Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics

no code implementations8 May 2021 Greg Yang, Etai Littwin

To facilitate this proof, we develop a graphical notation for Tensor Programs.

Feature Learning in Infinite-Width Neural Networks

4 code implementations30 Nov 2020 Greg Yang, Edward J. Hu

However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT.

Few-Shot Learning Transfer Learning

Tensor Programs III: Neural Matrix Laws

no code implementations22 Sep 2020 Greg Yang

FIP and these results hold for any neural architecture.

Tensor Programs II: Neural Tangent Kernel for Any Architecture

2 code implementations25 Jun 2020 Greg Yang

We prove that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity.

Improved Image Wasserstein Attacks and Defenses

1 code implementation26 Apr 2020 Edward J. Hu, Adith Swaminathan, Hadi Salman, Greg Yang

Robustness against image perturbations bounded by a $\ell_p$ ball have been well-studied in recent literature.

On Infinite-Width Hypernetworks

1 code implementation NeurIPS 2020 Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang

{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}.

Meta-Learning

Randomized Smoothing of All Shapes and Sizes

1 code implementation ICML 2020 Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, Jerry Li

Randomized smoothing is the current state-of-the-art defense with provable robustness against $\ell_2$ adversarial attacks.

Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

1 code implementation NeurIPS 2019 Greg Yang

Wide neural networks with random weights and biases are Gaussian processes, as observed by Neal (1995) for shallow networks, and more recently by Lee et al.~(2018) and Matthews et al.~(2018) for deep fully-connected networks, as well as by Novak et al.~(2019) and Garriga-Alonso et al.~(2019) for deep convolutional networks.

Gaussian Processes

Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

2 code implementations28 Oct 2019 Greg Yang

Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks.

Gaussian Processes

The Dynamics of Signal Propagation in Gated Recurrent Neural Networks

no code implementations25 Sep 2019 Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

Free resolutions of function classes via order complexes

no code implementations5 Sep 2019 Justin Chen, Christopher Eur, Greg Yang, Mengyuan Zhang

Function classes are collections of Boolean functions on a finite set, which are fundamental objects of study in theoretical computer science.

Learning Theory

A Fine-Grained Spectral Perspective on Neural Networks

1 code implementation24 Jul 2019 Greg Yang, Hadi Salman

Are neural networks biased toward simple functions?

A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

3 code implementations NeurIPS 2019 Hadi Salman, Greg Yang, huan zhang, Cho-Jui Hsieh, Pengchuan Zhang

This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification.

A Mean Field Theory of Batch Normalization

no code implementations ICLR 2019 Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

no code implementations13 Feb 2019 Greg Yang

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks.

Gaussian Processes Learning Theory

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

no code implementations25 Jan 2019 Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion

no code implementations ICLR 2018 Greg Yang, Sam S. Schoenholz

Using the obtained mean field theory, we are able to track surprisingly well how VV at initialization time affects training and test time performance on MNIST after a set number of epochs: the level sets of test/train set accuracies coincide with the level sets of the expectations of certain gradient norms or of metric expressivity (as defined in \cite{yang_meanfield_2017}), a measure of expansion in a random neural network.

Mean Field Residual Networks: On the Edge of Chaos

no code implementations NeurIPS 2017 Greg Yang, Samuel S. Schoenholz

Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward.

A Homological Theory of Functions

no code implementations9 Jan 2017 Greg Yang

In computational complexity, a complexity class is given by a set of problems or functions, and a basic challenge is to show separations of complexity classes $A \not= B$ especially when $A$ is known to be a subset of $B$.

LEMMA

Lie-Access Neural Turing Machines

no code implementations9 Nov 2016 Greg Yang, Alexander M. Rush

The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and memory access is performed by linear smoothing in key space.

Lie Access Neural Turing Machine

no code implementations28 Feb 2016 Greg Yang

We found the right configuration of LANTM to outperform the baseline in all of our experiments.

Computabilities of Validity and Satisfiability in Probability Logics over Finite and Countable Models

no code implementations12 Oct 2014 Greg Yang

In addition, most of the results, of this paper and of Kuyper and Terwijn, do not apply to individual languages with a finite number of unary predicates.

Learning Theory

Cannot find the paper you are looking for? You can Submit a new open access paper.