Search Results for author: Frederik Kunstner

Found 9 papers, 5 papers with code

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

no code implementations • 29 Feb 2024 • Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics.

Language Modelling

Paper
Add Code

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

1 code implementation • 27 Apr 2023 • Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt

This suggests that Adam outperform SGD because it uses a more robust gradient estimate.

Paper
Code

Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

no code implementations • 12 Nov 2021 • Rémi Le Priol, Frederik Kunstner, Damien Scieur, Simon Lacoste-Julien

We consider the problem of upper bounding the expected log-likelihood sub-optimality of the maximum likelihood estimate (MLE), or a conjugate maximum a posteriori (MAP) for an exponential family, in a non-asymptotic way.

Paper
Add Code

Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent

no code implementations • 2 Nov 2020 • Frederik Kunstner, Raunak Kumar, Mark Schmidt

In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence.

Paper
Add Code

Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)

no code implementations • 28 Sep 2020 • Sharan Vaswani, Issam H. Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions.

Binary Classification

Paper
Add Code

Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

1 code implementation • 11 Jun 2020 • Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate.

Binary Classification Multi-class Classification

Paper
Code

BackPACK: Packing more into backprop

1 code implementation • ICLR 2020 • Felix Dangel, Frederik Kunstner, Philipp Hennig

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient.

540

Paper
Code

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

1 code implementation • NeurIPS 2019 • Frederik Kunstner, Lukas Balles, Philipp Hennig

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information.

Second-order methods

Paper
Code

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

2 code implementations • NeurIPS 2018 • Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, Mohammad Emtiyaz Khan

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution.

Variational Inference

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.