Search Results for author: Arthur Conmy

Found 6 papers, 4 papers with code

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

no code implementations • 14 Dec 2023 • Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days.

Language Modelling

Paper
Add Code

Attribution Patching Outperforms Automated Circuit Discovery

1 code implementation • 16 Oct 2023 • Aaquib Syed, Can Rager, Arthur Conmy

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models.

Paper
Code

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Paper
Code

Towards Automated Circuit Discovery for Mechanistic Interpretability

2 code implementations • NeurIPS 2023 • Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

884

Paper
Code

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

3 code implementations • 1 Nov 2022 • Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.

Language Modelling

3,697

Paper
Code

StyleGAN-induced data-driven regularization for inverse problems

no code implementations • 7 Oct 2021 • Arthur Conmy, Subhadip Mukherjee, Carola-Bibiane Schönlieb

Our proposed approach, which we refer to as learned Bayesian reconstruction with generative models (L-BRGM), entails joint optimization over the style-code and the input latent code, and enhances the expressive power of a pre-trained StyleGAN2 generator by allowing the style-codes to be different for different generator layers.

Image Inpainting Image Reconstruction +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.