Search Results for author: Stefan Heimersheim

Found 2 papers, 1 papers with code

How to use and interpret activation patching

no code implementations23 Apr 2024 Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

Towards Automated Circuit Discovery for Mechanistic Interpretability

2 code implementations NeurIPS 2023 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

Cannot find the paper you are looking for? You can Submit a new open access paper.