no code implementations • 4 Feb 2025 • Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin
We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics -- instead of improving the geometry of the function itself.
no code implementations • 4 Nov 2024 • Jim Zhao, Sidak Pal Singh, Aurelien Lucchi
Finally, we empirically validate the bounds and uncover valuable insights into the influence of the analyzed architectural components.
no code implementations • 14 Oct 2024 • Weronika Ormaniec, Felix Dangel, Sidak Pal Singh
In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian.
no code implementations • 23 Jul 2024 • Giulia Lanzillotta, Sidak Pal Singh, Benjamin F. Grewe, Thomas Hofmann
We classify existing continual learning algorithms based on the approximation used, and we assess the practical effects of this distinction in common continual learning settings. Additionally, we study optimal continual learning objectives in the case of local polynomial approximations and we provide examples of existing algorithms implementing the optimal objectives
no code implementations • 24 Jun 2024 • Sidak Pal Singh, Linara Adilova, Michael Kamp, Asja Fischer, Bernhard Schölkopf, Thomas Hofmann
In this work, we take a step towards understanding it by providing a model of how the loss landscape needs to behave topographically for LMC (or the lack thereof) to manifest.
no code implementations • 12 Mar 2024 • Sidak Pal Singh, Bobby He, Thomas Hofmann, Bernhard Schölkopf
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters.
1 code implementation • 12 Feb 2024 • Alexander Theus, Olin Geimer, Friedrich Wicke, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
Structural pruning of neural networks conventionally relies on identifying and discarding less important neurons, a practice often resulting in significant accuracy loss that necessitates subsequent fine-tuning efforts.
no code implementations • 17 Nov 2023 • Vukasin Bozic, Danilo Dordevic, Daniele Coppola, Joseph Thommes, Sidak Pal Singh
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks.
1 code implementation • 9 Oct 2023 • Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities.
no code implementations • 2 Oct 2023 • Giulia Lanzillotta, Sidak Pal Singh, Benjamin F. Grewe, Thomas Hofmann
Deep learning has proved to be a successful paradigm for solving many challenges in machine learning.
no code implementations • 10 Jul 2023 • Alison Pouplin, Hrittik Roy, Sidak Pal Singh, Georgios Arvanitidis
In this work, we consider the loss landscape as an embedded Riemannian manifold and show that the differential geometric properties of the manifold can be used when analyzing the generalization abilities of a deep net.
no code implementations • 16 May 2023 • Sidak Pal Singh, Thomas Hofmann, Bernhard Schölkopf
While Convolutional Neural Networks (CNNs) have long been investigated and applied, as well as theorized, we aim to provide a slightly different perspective into their nature -- through the perspective of their Hessian maps.
1 code implementation • 21 Feb 2023 • Grigory Khromov, Sidak Pal Singh
Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability.
1 code implementation • 24 Aug 2022 • Elias Frantar, Sidak Pal Singh, Dan Alistarh
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data.
no code implementations • 7 Jun 2022 • Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, Aurelien Lucchi
First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization.
no code implementations • ICLR 2022 • Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, Bernhard Schölkopf
`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized.
no code implementations • NeurIPS 2021 • Sidak Pal Singh, Gregor Bachmann, Thomas Hofmann
Moreover, we demonstrate that our bounds remain faithful as an estimate of the numerical Hessian rank, for a larger class of models such as rectified and hyperbolic tangent networks.
1 code implementation • NeurIPS 2020 • Sidak Pal Singh, Dan Alistarh
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems.
2 code implementations • NeurIPS 2020 • Sidak Pal Singh, Martin Jaggi
Finally, our approach also provides a principled way to combine the parameters of neural networks with different widths, and we explore its application for model compression.
1 code implementation • 15 Jul 2019 • Sidak Pal Singh, Angela Fan, Michael Auli
Both are trained to reconstruct the sentence based on a latent code and our model can be used to generate text.
2 code implementations • 29 Aug 2018 • Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi
We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding.
no code implementations • 5 Jun 2018 • Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi
We propose a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding).