no code implementations • 29 Feb 2024 • Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti
We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks.
1 code implementation • 27 Apr 2023 • Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt
This suggests that Adam outperform SGD because it uses a more robust gradient estimate.
no code implementations • 12 Nov 2021 • Rémi Le Priol, Frederik Kunstner, Damien Scieur, Simon Lacoste-Julien
We consider the problem of upper bounding the expected log-likelihood sub-optimality of the maximum likelihood estimate (MLE), or a conjugate maximum a posteriori (MAP) for an exponential family, in a non-asymptotic way.
no code implementations • 2 Nov 2020 • Frederik Kunstner, Raunak Kumar, Mark Schmidt
In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence.
no code implementations • 28 Sep 2020 • Sharan Vaswani, Issam H. Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien
Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions.
1 code implementation • 11 Jun 2020 • Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien
In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate.
1 code implementation • ICLR 2020 • Felix Dangel, Frederik Kunstner, Philipp Hennig
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient.
1 code implementation • NeurIPS 2019 • Frederik Kunstner, Lukas Balles, Philipp Hennig
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information.
2 code implementations • NeurIPS 2018 • Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, Mohammad Emtiyaz Khan
Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution.