Search Results for author: János Kramár

Found 10 papers, 5 papers with code

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Explaining grokking through circuit efficiency

no code implementations5 Sep 2023 Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

Tracr: Compiled Transformers as a Laboratory for Interpretability

1 code implementation NeurIPS 2023 David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Thomas McGrath, Vladimir Mikulik

Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods.

Learning Reciprocity in Complex Sequential Social Dilemmas

no code implementations19 Mar 2019 Tom Eccles, Edward Hughes, János Kramár, Steven Wheelwright, Joel Z. Leibo

We analyse the resulting policies to show that the reciprocating agents are strongly influenced by their co-players' behavior.

Cannot find the paper you are looking for? You can Submit a new open access paper.