Search Results for author: Vikrant Varma

Found 10 papers, 3 papers with code

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

no code implementations22 Jan 2025 Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate.

Reinforcement Learning (RL)

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

2 code implementations9 Aug 2024 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison.

All

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

1 code implementation19 Jul 2024 Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension.

Improving Dictionary Learning with Gated Sparse Autoencoders

2 code implementations24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

Challenges with unsupervised LLM knowledge discovery

no code implementations15 Dec 2023 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent.

Language Modeling Language Modelling +1

Explaining grokking through circuit efficiency

no code implementations5 Sep 2023 Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

no code implementations4 Oct 2022 Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization.

Safe Deep RL in 3D Environments using Human Feedback

no code implementations20 Jan 2022 Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike

In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors.

Cannot find the paper you are looking for? You can Submit a new open access paper.