Search Results for author: Felix Dangel

Found 12 papers, 8 papers with code

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

no code implementations14 Oct 2024 Weronika Ormaniec, Felix Dangel, Sidak Pal Singh

In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian.

Lowering PyTorch's Memory Consumption for Selective Differentiation

1 code implementation15 Apr 2024 Samarth Bhatia, Felix Dangel

This information is useful though to reduce memory whenever gradients are requested for a parameter subset, as is the case in many modern fine-tuning tasks.

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

2 code implementations5 Feb 2024 Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers.

Second-order methods

On the Disconnect Between Theory and Practice of Neural Networks: Limits of the NTK Perspective

no code implementations29 Sep 2023 Jonathan Wenger, Felix Dangel, Agustinus Kristiadi

Kernel methods are theoretically well-understood and as a result enjoy algorithmic benefits, which can be demonstrated to hold in wide synthetic neural network architectures.

Continual Learning Uncertainty Quantification

Convolutions and More as Einsum: A Tensor Network Perspective with Advances for Second-Order Methods

no code implementations5 Jul 2023 Felix Dangel

Despite their simple intuition, convolutions are more tedious to analyze than dense layers, which complicates the transfer of theoretical and algorithmic ideas to convolutions.

Second-order methods Tensor Networks

ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

5 code implementations4 Jun 2021 Felix Dangel, Lukas Tatzel, Philipp Hennig

Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks.

BackPACK: Packing more into backprop

1 code implementation ICLR 2020 Felix Dangel, Frederik Kunstner, Philipp Hennig

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient.

Modular Block-diagonal Curvature Approximations for Feedforward Architectures

1 code implementation5 Feb 2019 Felix Dangel, Stefan Harmeling, Philipp Hennig

We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian).

BIG-bench Machine Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.