Search Results for author: Roger Grosse

Found 63 papers, 44 papers with code

Training Data Attribution via Approximate Unrolled Differentiation

1 code implementation20 May 2024 Juhan Bae, Wu Lin, Jonathan Lorraine, Roger Grosse

While being computationally efficient compared to unrolling-based approaches, Source is suitable in cases where implicit-differentiation-based approaches struggle, such as in non-converged models and multi-stage training pipelines.


Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

1 code implementation26 Apr 2024 Stephen Zhao, Rob Brekelmans, Alireza Makhzani, Roger Grosse

Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence.

Language Modelling Prompt Engineering

REFACTOR: Learning to Extract Theorems from Proofs

1 code implementation26 Feb 2024 Jin Peng Zhou, Yuhuai Wu, Qiyang Li, Roger Grosse

With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored.

Automated Theorem Proving

Studying Large Language Model Generalization with Influence Functions

2 code implementations7 Aug 2023 Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior?

counterfactual Language Modelling +2

Improving Mutual Information Estimation with Annealed and Energy-Based Bounds

1 code implementation ICLR 2022 Rob Brekelmans, Sicong Huang, Marzyeh Ghassemi, Greg Ver Steeg, Roger Grosse, Alireza Makhzani

Since accurate estimation of MI without density information requires a sample size exponential in the true MI, we assume either a single marginal or the full joint density information is known.

Mutual Information Estimation

Efficient Parametric Approximations of Neural Network Function Space Distance

no code implementations7 Feb 2023 Nikita Dhawan, Sicong Huang, Juhan Bae, Roger Grosse

It is often useful to compactly summarize important properties of model parameters and training data so that they can be used later without storing and/or iterating over the entire dataset.

Continual Learning

Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve

no code implementations7 Dec 2022 Juhan Bae, Michael R. Zhang, Michael Ruan, Eric Wang, So Hasegawa, Jimmy Ba, Roger Grosse

Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications.

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation

no code implementations18 Nov 2022 Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, Zico Kolter, Roger Grosse

Designing networks capable of attaining better performance with an increased inference budget is important to facilitate generalization to harder problem instances.

Toy Models of Superposition

1 code implementation21 Sep 2022 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging.

If Influence Functions are the Answer, Then What is the Question?

2 code implementations12 Sep 2022 Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, Roger Grosse

Influence functions efficiently estimate the effect of removing a single training data point on a model's learned parameters.

Amortized Proximal Optimization

no code implementations28 Feb 2022 Juhan Bae, Paul Vicol, Jeff Z. HaoChen, Roger Grosse

Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods.

Image Classification Image Reconstruction +2

Learning to Give Checkable Answers with Prover-Verifier Games

no code implementations27 Aug 2021 Cem Anil, Guodong Zhang, Yuhuai Wu, Roger Grosse

We develop instantiations of the PVG for two algorithmic tasks, and show that in practice, the verifier learns a robust decision rule that is able to receive useful and reliable information from an untrusted prover.

Differentiable Annealed Importance Sampling and the Perils of Gradient Noise

no code implementations NeurIPS 2021 Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, Roger Grosse

To this end, we propose Differentiable AIS (DAIS), a variant of AIS which ensures differentiability by abandoning the Metropolis-Hastings corrections.

Stochastic Optimization

Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition

2 code implementations10 Jun 2021 Shengyang Sun, Jiaxin Shi, Andrew Gordon Wilson, Roger Grosse

We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability.

Gaussian Processes regression

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

1 code implementation22 Apr 2021 James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective.

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

1 code implementation15 Jan 2021 Yuhuai Wu, Markus Rabe, Wenda Li, Jimmy Ba, Roger Grosse, Christian Szegedy

While designing inductive bias in neural architectures has been widely studied, we hypothesize that transformer networks are flexible enough to learn inductive bias from suitable generic tasks.

Inductive Bias Mathematical Reasoning

Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations?

1 code implementation6 Nov 2020 Chaoqi Wang, Shengyang Sun, Roger Grosse

While uncertainty estimation is a well-studied topic in deep learning, most such work focuses on marginal uncertainty estimates, i. e. the predictive mean and variance at individual input locations.

Active Learning Benchmarking +1

A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints

1 code implementation23 Sep 2020 Guodong Zhang, Xuchan Bao, Laurent Lessard, Roger Grosse

The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements.

Evaluating Lossy Compression Rates of Deep Generative Models

2 code implementations ICML 2020 Sicong Huang, Alireza Makhzani, Yanshuai Cao, Roger Grosse

The field of deep generative modeling has succeeded in producing astonishingly realistic-seeming images and audio, but quantitative evaluation remains a challenge.

Regularized linear autoencoders recover the principal components, eventually

1 code implementation NeurIPS 2020 Xuchan Bao, James Lucas, Sushant Sachdeva, Roger Grosse

Our understanding of learning input-output relationships with neural nets has improved rapidly in recent years, but little is known about the convergence of the underlying representations, even in the simple case of linear autoencoders (LAEs).

The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning

3 code implementations8 Jul 2020 Yuhuai Wu, Honghua Dong, Roger Grosse, Jimmy Ba

In this work, we focus on an analogical reasoning task that contains rich compositional structures, Raven's Progressive Matrices (RPM).

Zero-shot Generalization

Learning Branching Heuristics for Propositional Model Counting

no code implementations7 Jul 2020 Pashootan Vaezipoor, Gil Lederman, Yuhuai Wu, Chris J. Maddison, Roger Grosse, Sanjit A. Seshia, Fahiem Bacchus

In addition to step count improvements, Neuro# can also achieve orders of magnitude wall-clock speedups over the vanilla solver on larger instances in some problem families, despite the runtime overhead of querying the model.

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

1 code implementation ICLR 2021 Yuhuai Wu, Albert Qiaochu Jiang, Jimmy Ba, Roger Grosse

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time.

Automated Theorem Proving

When Does Preconditioning Help or Hurt Generalization?

no code implementations ICLR 2021 Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, Ji Xu

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question.

regression Second-order methods

Understanding and Mitigating Exploding Inverses in Invertible Neural Networks

1 code implementation16 Jun 2020 Jens Behrmann, Paul Vicol, Kuan-Chieh Wang, Roger Grosse, Jörn-Henrik Jacobsen

For problems where global invertibility is necessary, such as applying normalizing flows on OOD data, we show the importance of designing stable INN building blocks.

Picking Winning Tickets Before Training by Preserving Gradient Flow

3 code implementations ICLR 2020 Chaoqi Wang, Guodong Zhang, Roger Grosse

Overparameterization has been shown to benefit both the optimization and generalization of neural networks, but large networks are resource hungry at both training and test time.

Network Pruning

Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

no code implementations NeurIPS 2019 James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi

Posterior collapse in Variational Autoencoders (VAEs) arises when the variational posterior distribution closely matches the prior for a subset of latent variables.

Variational Inference

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

1 code implementation NeurIPS 2019 Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns.

Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

no code implementations27 May 2019 Guodong Zhang, James Martens, Roger Grosse

In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss.

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

1 code implementation15 May 2019 Chaoqi Wang, Roger Grosse, Sanja Fidler, Guodong Zhang

Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resource-constrained devices.

Network Pruning

Online Hyperparameter Adaptation via Amortized Proximal Optimization

no code implementations ICLR 2019 Paul Vicol, Jeffery Z. HaoChen, Roger Grosse

Effective performance of neural networks depends critically on effective tuning of optimization hyperparameters, especially learning rates (and schedules thereof).

Understanding Posterior Collapse in Generative Latent Variable Models

no code implementations ICLR Workshop DeepGenStruct 2019 James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi

Posterior collapse in Variational Autoencoders (VAEs) arises when the variational distribution closely matches the uninformative prior for a subset of latent variables.

Variational Inference

Functional Variational Bayesian Neural Networks

3 code implementations ICLR 2019 Shengyang Sun, Guodong Zhang, Jiaxin Shi, Roger Grosse

We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i. e. distributions over functions.

Bayesian Inference Gaussian Processes +1

Eigenvalue Corrected Noisy Natural Gradient

3 code implementations30 Nov 2018 Juhan Bae, Guodong Zhang, Roger Grosse

A recently proposed method, noisy natural gradient, is a surprisingly simple method to fit expressive posteriors by adding weight noise to regular natural gradient updates.

Sorting out Lipschitz function approximation

1 code implementation13 Nov 2018 Cem Anil, James Lucas, Roger Grosse

We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation.

Adversarial Robustness Generalization Bounds

Three Mechanisms of Weight Decay Regularization

no code implementations ICLR 2019 Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization.

Reversible Recurrent Neural Networks

1 code implementation NeurIPS 2018 Matthew MacKay, Paul Vicol, Jimmy Ba, Roger Grosse

Reversible RNNs---RNNs for which the hidden-to-hidden transition can be reversed---offer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation.


A Coordinate-Free Construction of Scalable Natural Gradient

no code implementations30 Aug 2018 Kevin Luk, Roger Grosse

Most neural networks are trained using first-order optimization methods, which are sensitive to the parameterization of the model.

Distilling the Posterior in Bayesian Neural Networks

no code implementations ICML 2018 Kuan-Chieh Wang, Paul Vicol, James Lucas, Li Gu, Roger Grosse, Richard Zemel

We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using a Generative Adversarial Network (GAN).

Active Learning Anomaly Detection +1

Adversarial Distillation of Bayesian Neural Network Posteriors

1 code implementation27 Jun 2018 Kuan-Chieh Wang, Paul Vicol, James Lucas, Li Gu, Roger Grosse, Richard Zemel

We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using a Generative Adversarial Network (GAN).

Active Learning Anomaly Detection +1

Aggregated Momentum: Stability Through Passive Damping

1 code implementation ICLR 2019 James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse

Momentum is a simple and widely used trick which allows gradient-based optimizers to pick up speed along low curvature directions.

Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

3 code implementations ICLR 2018 Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse

Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies.

Understanding Short-Horizon Bias in Stochastic Meta-Optimization

1 code implementation ICLR 2018 Yuhuai Wu, Mengye Ren, Renjie Liao, Roger Grosse

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training.

Isolating Sources of Disentanglement in Variational Autoencoders

10 code implementations NeurIPS 2018 Ricky T. Q. Chen, Xuechen Li, Roger Grosse, David Duvenaud

We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables.


Noisy Natural Gradient as Variational Inference

2 code implementations ICML 2018 Guodong Zhang, Shengyang Sun, David Duvenaud, Roger Grosse

Variational Bayesian neural nets combine the flexibility of deep learning with Bayesian uncertainty estimation.

Active Learning Efficient Exploration +2

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

8 code implementations NeurIPS 2017 Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, Jimmy Ba

In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature.

Atari Games Continuous Control +2

On the Quantitative Analysis of Decoder-Based Generative Models

2 code implementations14 Nov 2016 Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, Roger Grosse

The past several years have seen remarkable progress in generative models which produce convincing samples of images and other modalities.


A Kronecker-factored approximate Fisher matrix for convolution layers

2 code implementations3 Feb 2016 Roger Grosse, James Martens

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function.

Learning Wake-Sleep Recurrent Attention Models

no code implementations NeurIPS 2015 Jimmy Ba, Roger Grosse, Ruslan Salakhutdinov, Brendan Frey

Despite their success, convolutional neural networks are computationally expensive because they must examine all image locations.

Caption Generation Computational Efficiency +2

Statistical Inference, Learning and Models in Big Data

no code implementations9 Sep 2015 Beate Franke, Jean-François Plante, Ribana Roscher, Annie Lee, Cathal Smyth, Armin Hatefi, Fuqi Chen, Einat Gil, Alexander Schwing, Alessandro Selvitella, Michael M. Hoffman, Roger Grosse, Dieter Hendricks, Nancy Reid

The need for new methods to deal with big data is a common theme in most scientific fields, although its definition tends to vary with the context.

Importance Weighted Autoencoders

23 code implementations1 Sep 2015 Yuri Burda, Roger Grosse, Ruslan Salakhutdinov

The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference.

Density Estimation

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

17 code implementations19 Mar 2015 James Martens, Roger Grosse

This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Stochastic Optimization

Cannot find the paper you are looking for? You can Submit a new open access paper.