Search Results for author: Neel Nanda

Found 21 papers, 12 papers with code

How to use and interpret activation patching

no code implementations23 Apr 2024 Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Explorations of Self-Repair in Language Models

1 code implementation23 Feb 2024 Cody Rushing, Neel Nanda

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

no code implementations11 Feb 2024 Bilal Chughtai, Alan Cooney, Neel Nanda

How do transformer-based large language models (LLMs) store and retrieve knowledge?

Attribute

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation28 Nov 2023 Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Training Dynamics of Contextual N-Grams in Language Models

1 code implementation1 Nov 2023 Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active.

Linear Representations of Sentiment in Large Language Models

1 code implementation23 Oct 2023 Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).

Zero-Shot Learning

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation6 Oct 2023 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

no code implementations27 Sep 2023 Fred Zhang, Neel Nanda

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step.

Neuron to Graph: Interpreting Language Model Neurons at Scale

1 code implementation31 May 2023 Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.

Language Modelling

Finding Neurons in a Haystack: Case Studies with Sparse Probing

2 code implementations2 May 2023 Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood.

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

no code implementations22 Apr 2023 Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

1 code implementation6 Feb 2023 Bilal Chughtai, Lawrence Chan, Neel Nanda

Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.

Progress measures for grokking via mechanistic interpretability

1 code implementation12 Jan 2023 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

In-context Learning and Induction Heads

no code implementations24 Sep 2022 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah

In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i. e. decreasing loss at increasing token indices).

In-Context Learning

An Empirical Investigation of Learning from Biased Toxicity Labels

no code implementations4 Oct 2021 Neel Nanda, Jonathan Uesato, Sven Gowal

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels.

Fairness

Fully General Online Imitation Learning

no code implementations17 Feb 2021 Michael K. Cohen, Marcus Hutter, Neel Nanda

If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time.

Imitation Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.