1 code implementation • 23 May 2025 • Patrick Leask, Neel Nanda, Noura Al Moubayed
To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary.
1 code implementation • 20 May 2025 • Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.
no code implementations • 18 Apr 2025 • Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda
In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL).
1 code implementation • 3 Apr 2025 • Julian Minder, Clement Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning.
no code implementations • 2 Apr 2025 • Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr, Shane Legg, Noah Goodman, Allan Dafoe, Four Flynn, Anca Dragan
Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.
2 code implementations • 21 Mar 2025 • Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda
This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features.
no code implementations • 12 Mar 2025 • Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning.
1 code implementation • 11 Mar 2025 • Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities.
1 code implementation • 23 Feb 2025 • Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations.
no code implementations • 7 Feb 2025 • Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda
Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic.
no code implementations • 27 Jan 2025 • Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals.
2 code implementations • 9 Dec 2024 • Bart Bussmann, Patrick Leask, Neel Nanda
A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations.
no code implementations • 28 Nov 2024 • Adam Karvonen, Can Rager, Samuel Marks, Neel Nanda
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units.
no code implementations • 21 Nov 2024 • Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem.
2 code implementations • 9 Aug 2024 • Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison.
1 code implementation • 19 Jul 2024 • Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension.
1 code implementation • 25 Jun 2024 • Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream.
1 code implementation • 24 Jun 2024 • Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored.
1 code implementation • 17 Jun 2024 • Jacob Dunefsky, Philippe Chlenski, Neel Nanda
We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers.
1 code implementation • 17 Jun 2024 • Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
no code implementations • 14 May 2024 • Aleksandar Makelov, George Lange, Neel Nanda
Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features).
2 code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.
no code implementations • 23 Apr 2024 • Stefan Heimersheim, Neel Nanda
Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.
no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda
We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.
1 code implementation • 23 Feb 2024 • Cody Rushing, Neel Nanda
Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.
no code implementations • 11 Feb 2024 • Bilal Chughtai, Alan Cooney, Neel Nanda
How do transformer-based large language models (LLMs) store and retrieve knowledge?
1 code implementation • 22 Jan 2024 • Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
In other words, are neural mechanisms universal across different models?
1 code implementation • 28 Nov 2023 • Aleksandar Makelov, Georg Lange, Neel Nanda
We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.
1 code implementation • 1 Nov 2023 • Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda
We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active.
1 code implementation • 23 Oct 2023 • Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).
1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda
We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.
no code implementations • 27 Sep 2023 • Fred Zhang, Neel Nanda
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step.
2 code implementations • 2 Sep 2023 • Neel Nanda, Andrew Lee, Martin Wattenberg
How do sequence models represent their decision-making process?
no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.
1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez
Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.
2 code implementations • 2 May 2023 • Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood.
no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez
Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.
1 code implementation • 6 Feb 2023 • Bilal Chughtai, Lawrence Chan, Neel Nanda
Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.
1 code implementation • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.
no code implementations • 24 Sep 2022 • Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah
In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i. e. decreasing loss at increasing token indices).
4 code implementations • 12 Apr 2022 • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.
no code implementations • 4 Oct 2021 • Neel Nanda, Jonathan Uesato, Sven Gowal
Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels.
no code implementations • 17 Feb 2021 • Michael K. Cohen, Marcus Hutter, Neel Nanda
If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time.