Search Results for author: Stefan Heimersheim

Found 9 papers, 3 papers with code

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

no code implementations16 Oct 2024 Daniel J. Lee, Stefan Heimersheim

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions.

Evolution of SAE Features Across Layers in LLMs

no code implementations11 Oct 2024 Daniel Balcells, Benjamin Lerner, Michael Oesterle, Ediz Ucar, Stefan Heimersheim

Sparse Autoencoders for transformer-based language models are typically defined independently per layer.

Characterizing stable regions in the residual stream of LLMs

no code implementations25 Sep 2024 Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim

We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries.

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

no code implementations23 Sep 2024 Giorgi Giglemiani, Nora Petrova, Chatrik Singh Mangat, Jett Janiak, Stefan Heimersheim

In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents.

You can remove GPT2's LayerNorm by fine-tuning

1 code implementation6 Sep 2024 Stefan Heimersheim

The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability.

HellaSwag

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

1 code implementation17 May 2024 Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB).

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

no code implementations17 May 2024 Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions.

Learning Theory

How to use and interpret activation patching

no code implementations23 Apr 2024 Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

Towards Automated Circuit Discovery for Mechanistic Interpretability

4 code implementations NeurIPS 2023 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

Cannot find the paper you are looking for? You can Submit a new open access paper.