Search Results for author: Arthur Conmy

Found 13 papers, 10 papers with code

Improving Steering Vectors by Targeting Sparse Autoencoder Features

1 code implementation4 Nov 2024 Sviatoslav Chalnev, Matthew Siu, Arthur Conmy

To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties.

Applying sparse autoencoders to unlearn knowledge in language models

1 code implementation25 Oct 2024 Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models.

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

2 code implementations9 Aug 2024 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison.

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

1 code implementation19 Jul 2024 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension.

Interpreting Attention Layer Outputs with Sparse Autoencoders

1 code implementation25 Jun 2024 Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream.

Improving Dictionary Learning with Gated Sparse Autoencoders

1 code implementation24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

no code implementations14 Dec 2023 Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days.

Language Modelling

Attribution Patching Outperforms Automated Circuit Discovery

4 code implementations16 Oct 2023 Aaquib Syed, Can Rager, Arthur Conmy

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models.

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation6 Oct 2023 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Towards Automated Circuit Discovery for Mechanistic Interpretability

4 code implementations NeurIPS 2023 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

6 code implementations1 Nov 2022 Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.

Language Modeling Language Modelling

StyleGAN-induced data-driven regularization for inverse problems

no code implementations7 Oct 2021 Arthur Conmy, Subhadip Mukherjee, Carola-Bibiane Schönlieb

Our proposed approach, which we refer to as learned Bayesian reconstruction with generative models (L-BRGM), entails joint optimization over the style-code and the input latent code, and enhances the expressive power of a pre-trained StyleGAN2 generator by allowing the style-codes to be different for different generator layers.

Image Inpainting Image Reconstruction +1

Cannot find the paper you are looking for? You can Submit a new open access paper.