Search Results for author: Neel Nanda

Found 43 papers, 26 papers with code

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

1 code implementation23 May 2025 Patrick Leask, Neel Nanda, Noura Al Moubayed

To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary.

Language Modeling Language Modelling

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

1 code implementation20 May 2025 Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

Scaling sparse feature circuit finding for in-context learning

no code implementations18 Apr 2025 Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda

In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL).

In-Context Learning Large Language Model

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

1 code implementation3 Apr 2025 Julian Minder, Clement Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning.

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

2 code implementations21 Mar 2025 Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda

This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

no code implementations12 Mar 2025 Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning.

Disentanglement Language Modeling +1

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

1 code implementation11 Mar 2025 Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities.

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

1 code implementation23 Feb 2025 Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations.

Inductive Bias Large Language Model

Sparse Autoencoders Do Not Find Canonical Units of Analysis

no code implementations7 Feb 2025 Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda

Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic.

BatchTopK Sparse Autoencoders

2 code implementations9 Dec 2024 Bart Bussmann, Patrick Leask, Neel Nanda

A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations.

Language Modeling Language Modelling

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

no code implementations28 Nov 2024 Adam Karvonen, Can Rager, Samuel Marks, Neel Nanda

Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units.

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

no code implementations21 Nov 2024 Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem.

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

2 code implementations9 Aug 2024 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison.

All

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

1 code implementation19 Jul 2024 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension.

Interpreting Attention Layer Outputs with Sparse Autoencoders

1 code implementation25 Jun 2024 Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream.

Confidence Regulation Neurons in Language Models

1 code implementation24 Jun 2024 Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored.

Transcoders Find Interpretable LLM Feature Circuits

1 code implementation17 Jun 2024 Jacob Dunefsky, Philippe Chlenski, Neel Nanda

We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers.

Refusal in Language Models Is Mediated by a Single Direction

1 code implementation17 Jun 2024 Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.

Instruction Following

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

no code implementations14 May 2024 Aleksandar Makelov, George Lange, Neel Nanda

Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features).

Dictionary Learning

Improving Dictionary Learning with Gated Sparse Autoencoders

2 code implementations24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

How to use and interpret activation patching

no code implementations23 Apr 2024 Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Explorations of Self-Repair in Language Models

1 code implementation23 Feb 2024 Cody Rushing, Neel Nanda

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

no code implementations11 Feb 2024 Bilal Chughtai, Alan Cooney, Neel Nanda

How do transformer-based large language models (LLMs) store and retrieve knowledge?

Attribute

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation28 Nov 2023 Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Training Dynamics of Contextual N-Grams in Language Models

1 code implementation1 Nov 2023 Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active.

Linear Representations of Sentiment in Large Language Models

1 code implementation23 Oct 2023 Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).

Zero-Shot Learning

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation6 Oct 2023 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

no code implementations27 Sep 2023 Fred Zhang, Neel Nanda

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step.

Neuron to Graph: Interpreting Language Model Neurons at Scale

1 code implementation31 May 2023 Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.

Language Modeling Language Modelling

Finding Neurons in a Haystack: Case Studies with Sparse Probing

2 code implementations2 May 2023 Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood.

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

no code implementations22 Apr 2023 Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

1 code implementation6 Feb 2023 Bilal Chughtai, Lawrence Chan, Neel Nanda

Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.

Progress measures for grokking via mechanistic interpretability

1 code implementation12 Jan 2023 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

In-context Learning and Induction Heads

no code implementations24 Sep 2022 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah

In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i. e. decreasing loss at increasing token indices).

In-Context Learning

An Empirical Investigation of Learning from Biased Toxicity Labels

no code implementations4 Oct 2021 Neel Nanda, Jonathan Uesato, Sven Gowal

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels.

Fairness

Fully General Online Imitation Learning

no code implementations17 Feb 2021 Michael K. Cohen, Marcus Hutter, Neel Nanda

If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time.

Imitation Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.