Search Results for author: Vikrant Varma

Found 6 papers, 0 papers with code

Safe Deep RL in 3D Environments using Human Feedback

no code implementations20 Jan 2022 Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike

In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors.

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

no code implementations4 Oct 2022 Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization.

Explaining grokking through circuit efficiency

no code implementations5 Sep 2023 Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

Challenges with unsupervised LLM knowledge discovery

no code implementations15 Dec 2023 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent.

Language Modelling Large Language Model

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Cannot find the paper you are looking for? You can Submit a new open access paper.