no code implementations • 16 Oct 2024 • Daniel J. Lee, Stefan Heimersheim
Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions.
no code implementations • 11 Oct 2024 • Daniel Balcells, Benjamin Lerner, Michael Oesterle, Ediz Ucar, Stefan Heimersheim
Sparse Autoencoders for transformer-based language models are typically defined independently per layer.
no code implementations • 25 Sep 2024 • Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim
We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries.
no code implementations • 23 Sep 2024 • Giorgi Giglemiani, Nora Petrova, Chatrik Singh Mangat, Jett Janiak, Stefan Heimersheim
In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents.
1 code implementation • 6 Sep 2024 • Stefan Heimersheim
The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability.
1 code implementation • 17 May 2024 • Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn
We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB).
no code implementations • 17 May 2024 • Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn
We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions.
no code implementations • 23 Apr 2024 • Stefan Heimersheim, Neel Nanda
Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.
4 code implementations • NeurIPS 2023 • Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.