Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models.
We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.
For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.
Our proposed approach, which we refer to as learned Bayesian reconstruction with generative models (L-BRGM), entails joint optimization over the style-code and the input latent code, and enhances the expressive power of a pre-trained StyleGAN2 generator by allowing the style-codes to be different for different generator layers.