Search Results for author: Monte MacDiarmid

Found 5 papers, 4 papers with code

Alignment faking in large language models

1 code implementation18 Dec 2024 Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

Large Language Model

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

1 code implementation14 Jun 2024 Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments.

Language Modelling Large Language Model

Understanding and Controlling a Maze-Solving Policy Network

no code implementations12 Oct 2023 Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares.

Steering Language Models With Activation Engineering

2 code implementations20 Aug 2023 Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs.

Language Modeling Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.