no code implementations • 14 Oct 2024 • Weronika Ormaniec, Felix Dangel, Sidak Pal Singh
In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian.
1 code implementation • 5 Jun 2024 • Mohamed Elsayed, Homayoon Farrahi, Felix Dangel, A. Rupam Mahmood
Second-order information is valuable for many applications but challenging to compute.
no code implementations • 24 May 2024 • Felix Dangel, Johannes Müller, Marius Zeinhofer
Physics-informed neural networks (PINNs) are infamous for being hard to train.
1 code implementation • 15 Apr 2024 • Samarth Bhatia, Felix Dangel
This information is useful though to reduce memory whenever gradients are requested for a parameter subset, as is the case in many modern fine-tuning tasks.
2 code implementations • 5 Feb 2024 • Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani
Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers.
2 code implementations • 9 Dec 2023 • Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, Alireza Makhzani
Second-order methods such as KFAC can be useful for neural net training.
no code implementations • 29 Sep 2023 • Jonathan Wenger, Felix Dangel, Agustinus Kristiadi
Kernel methods are theoretically well-understood and as a result enjoy algorithmic benefits, which can be demonstrated to hold in wide synthetic neural network architectures.
no code implementations • 5 Jul 2023 • Felix Dangel
Despite their simple intuition, convolutions are more tedious to analyze than dense layers, which complicates the transfer of theoretical and algorithmic ideas to convolutions.
5 code implementations • 4 Jun 2021 • Felix Dangel, Lukas Tatzel, Philipp Hennig
Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks.
2 code implementations • NeurIPS 2021 • Frank Schneider, Felix Dangel, Philipp Hennig
When engineers train deep learning models, they are very much 'flying blind'.
1 code implementation • ICLR 2020 • Felix Dangel, Frederik Kunstner, Philipp Hennig
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient.
1 code implementation • 5 Feb 2019 • Felix Dangel, Stefan Harmeling, Philipp Hennig
We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian).