Search Results for author: James Martens

Found 21 papers, 10 papers with code

Disentangling the Causes of Plasticity Loss in Neural Networks

no code implementations29 Feb 2024 Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, Will Dabney

Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution.

Atari Games reinforcement-learning

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

no code implementations20 Feb 2023 Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood.

Pre-training via Denoising for Molecular Property Prediction

1 code implementation31 May 2022 Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, Jonathan Godwin

Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks.

Denoising Molecular Property Prediction +1

Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

1 code implementation ICLR 2022 Guodong Zhang, Aleksandar Botev, James Martens

However, this method (called Deep Kernel Shaping) isn't fully compatible with ReLUs, and produces networks that overfit significantly more than ResNets on ImageNet.

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

2 code implementations5 Oct 2021 James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function.

On the validity of kernel approximations for orthogonally-initialized neural networks

no code implementations13 Apr 2021 James Martens

In this note we extend kernel function approximation results for neural networks with Gaussian-distributed weights to single-layer networks initialized using Haar-distributed random orthogonal matrices (with possible rescaling).

Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

no code implementations NeurIPS 2019 Guodong Zhang, James Martens, Roger B. Grosse

For two-layer ReLU neural networks (i. e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized.

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

1 code implementation NeurIPS 2019 Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns.

Adversarial Robustness through Local Linearization

no code implementations NeurIPS 2019 Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli

Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack.

Adversarial Defense Adversarial Robustness

Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

no code implementations27 May 2019 Guodong Zhang, James Martens, Roger Grosse

In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss.

Differentiable Game Mechanics

1 code implementation13 May 2019 Alistair Letcher, David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel

The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games.

On the Variance of Unbiased Online Recurrent Optimization

no code implementations6 Feb 2019 Tim Cooijmans, James Martens

The recently proposed Unbiased Online Recurrent Optimization algorithm (UORO, arXiv:1702. 05043) uses an unbiased approximation of RTRL to achieve fully online gradient-based learning in RNNs.

The Mechanics of n-Player Differentiable Games

1 code implementation ICML 2018 David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel

The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.

Kronecker-factored Curvature Approximations for Recurrent Neural Networks

no code implementations ICLR 2018 James Martens, Jimmy Ba, Matt Johnson

Kronecker-factor Approximate Curvature (Martens & Grosse, 2015) (K-FAC) is a 2nd-order optimization method which has been shown to give state-of-the-art performance on large-scale neural network optimization tasks (Ba et al., 2017).

A Kronecker-factored approximate Fisher matrix for convolution layers

2 code implementations3 Feb 2016 Roger Grosse, James Martens

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function.

Adding Gradient Noise Improves Learning for Very Deep Networks

4 code implementations21 Nov 2015 Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens

This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks.

Question Answering

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

16 code implementations19 Mar 2015 James Martens, Roger Grosse

This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Stochastic Optimization

New insights and perspectives on the natural gradient method

2 code implementations3 Dec 2014 James Martens

Additionally, we make the following contributions to the understanding of natural gradient and 2nd-order methods: a thorough analysis of the convergence speed of stochastic natural gradient descent (and more general stochastic 2nd-order methods) as applied to convex quadratics, a critical examination of the oft-used "empirical" approximation of the Fisher matrix, and an analysis of the (approximate) parameterization invariance property possessed by natural gradient methods, which we show still holds for certain choices of the curvature matrix other than the Fisher, but notably not the Hessian.

On the Expressive Efficiency of Sum Product Networks

no code implementations27 Nov 2014 James Martens, Venkatesh Medabalimi

In this work we analyze the D&C conditions, expose the various connections that D&C SPNs have with multilinear arithmetic circuits, and consider the question of how well they can capture various distributions as a function of their size and depth.

On the Representational Efficiency of Restricted Boltzmann Machines

no code implementations NeurIPS 2013 James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel

This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)?

On the importance of initialization and momentum in deep learning

no code implementations Proceedings of the 30th International Conference on Machine Learning 2013 Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum.

Second-order methods

Cannot find the paper you are looking for? You can Submit a new open access paper.