# A Spectral Condition for Feature Learning

The push to train ever larger neural networks has motivated the study of initialization and training at large network width.

# Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i. e., predicting optimal hyperparameters of wide neural networks from narrow ones.

# Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

no code implementations3 Aug 2023,

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam?

# Width and Depth Limits Commute in Residual Networks

no code implementations1 Feb 2023,

We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken.

# High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$.

# Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters.

1,211

# CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.

32

# Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features

While a popular limit of infinite-width neural networks, the Neural Tangent Kernel (NTK) often exhibits performance gaps from finite-width neural networks on standard datasets, due to lack of feature learning.

# Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck.

# 3DB: A Framework for Debugging Computer Vision Models

We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation.

123

# Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics

no code implementations8 May 2021,

To facilitate this proof, we develop a graphical notation for Tensor Programs.

# Feature Learning in Infinite-Width Neural Networks

4 code implementations30 Nov 2020,

However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT.

1,211

# Tensor Programs III: Neural Matrix Laws

no code implementations22 Sep 2020

FIP and these results hold for any neural architecture.

# Tensor Programs II: Neural Tangent Kernel for Any Architecture

2 code implementations25 Jun 2020

We prove that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity.

235

# Improved Image Wasserstein Attacks and Defenses

Robustness against image perturbations bounded by a $\ell_p$ ball have been well-studied in recent literature.

13

# On Infinite-Width Hypernetworks

{\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}.

1

# Denoised Smoothing: A Provable Defense for Pretrained Classifiers

We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks.

94

# Randomized Smoothing of All Shapes and Sizes

Randomized smoothing is the current state-of-the-art defense with provable robustness against $\ell_2$ adversarial attacks.

50

# Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

Wide neural networks with random weights and biases are Gaussian processes, as observed by Neal (1995) for shallow networks, and more recently by Lee et al.~(2018) and Matthews et al.~(2018) for deep fully-connected networks, as well as by Novak et al.~(2019) and Garriga-Alonso et al.~(2019) for deep convolutional networks.

235

# Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

2 code implementations28 Oct 2019

Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks.

235

# The Dynamics of Signal Propagation in Gated Recurrent Neural Networks

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

# Free resolutions of function classes via order complexes

Function classes are collections of Boolean functions on a finite set, which are fundamental objects of study in theoretical computer science.

# A Fine-Grained Spectral Perspective on Neural Networks

1 code implementation24 Jul 2019,

Are neural networks biased toward simple functions?

47

# Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers

In this paper, we employ adversarial training to improve the performance of randomized smoothing.

221

# Deep Bayesian Convolutional Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).

# A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification.

214

# A Mean Field Theory of Batch Normalization

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.

# Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

no code implementations13 Feb 2019

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks.

# NAIL: A General Interactive Fiction Agent

Interactive Fiction (IF) games are complex textual decision making problems.

44

# Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower.

# Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs).

# Deep Mean Field Theory: Layerwise Variance and Width Variation as Methods to Control Gradient Explosion

Using the obtained mean field theory, we are able to track surprisingly well how VV at initialization time affects training and test time performance on MNIST after a set number of epochs: the level sets of test/train set accuracies coincide with the level sets of the expectations of certain gradient norms or of metric expressivity (as defined in \cite{yang_meanfield_2017}), a measure of expansion in a random neural network.

# Mean Field Residual Networks: On the Edge of Chaos

Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward.

# A Homological Theory of Functions

no code implementations9 Jan 2017

In computational complexity, a complexity class is given by a set of problems or functions, and a basic challenge is to show separations of complexity classes $A \not= B$ especially when $A$ is known to be a subset of $B$.

# Lie-Access Neural Turing Machines

no code implementations9 Nov 2016,

The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and memory access is performed by linear smoothing in key space.

# Lie Access Neural Turing Machine

no code implementations28 Feb 2016

We found the right configuration of LANTM to outperform the baseline in all of our experiments.

# Computabilities of Validity and Satisfiability in Probability Logics over Finite and Countable Models

no code implementations12 Oct 2014

In addition, most of the results, of this paper and of Kuyper and Terwijn, do not apply to individual languages with a finite number of unary predicates.

Cannot find the paper you are looking for? You can Submit a new open access paper.