Search Results for author: Rohin Shah

Found 24 papers, 8 papers with code

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Challenges with unsupervised LLM knowledge discovery

no code implementations15 Dec 2023 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent.

Language Modelling Large Language Model

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

1 code implementation NeurIPS 2023 Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah

Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment.

Benchmarking

Explaining grokking through circuit efficiency

no code implementations5 Sep 2023 Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

SIRL: Similarity-based Implicit Representation Learning

no code implementations2 Jan 2023 Andreea Bobu, Yi Liu, Rohin Shah, Daniel S. Brown, Anca D. Dragan

This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not.

Contrastive Learning Data Augmentation +1

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

no code implementations4 Oct 2022 Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization.

An Empirical Investigation of Representation Learning for Imitation

2 code implementations16 May 2022 Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.

Image Classification Imitation Learning +1

The MineRL BASALT Competition on Learning from Human Feedback

no code implementations5 Jul 2021 Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan

Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve.

Imitation Learning

Learning What To Do by Simulating the Past

1 code implementation ICLR 2021 David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan

Since reward functions are hard to specify, recent work has focused on learning policies from human feedback.

Combining Reward Information from Multiple Sources

no code implementations22 Mar 2021 Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof

We study this problem in the setting with two conflicting reward functions learned from different sources.

Informativeness

Choice Set Misspecification in Reward Inference

no code implementations19 Jan 2021 Rachel Freedman, Rohin Shah, Anca Dragan

A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections.

Evaluating the Robustness of Collaborative Agents

no code implementations14 Jan 2021 Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah

We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness.

Benefits of Assistance over Reward Learning

no code implementations1 Jan 2021 Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.

The MAGICAL Benchmark for Robust Imitation

1 code implementation NeurIPS 2020 Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings.

Imitation Learning

Optimal Policies Tend to Seek Power

1 code implementation NeurIPS 2021 Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.

Reinforcement Learning (RL)

On the Utility of Learning about Humans for Human-AI Coordination

2 code implementations NeurIPS 2019 Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves.

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

no code implementations23 Jun 2019 Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan

But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach.

Preferences Implicit in the State of the World

1 code implementation ICLR 2019 Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan

We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.

Reinforcement Learning (RL)

Active Inverse Reward Design

1 code implementation9 Sep 2018 Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

Cannot find the paper you are looking for? You can Submit a new open access paper.