Search Results for author: Alexander Matt Turner

Found 9 papers, 5 papers with code

Steering Llama 2 via Contrastive Activation Addition

3 code implementations9 Dec 2023 Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes.

Multiple-choice

Understanding and Controlling a Maze-Solving Policy Network

no code implementations12 Oct 2023 Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares.

Activation Addition: Steering Language Models Without Optimization

1 code implementation20 Aug 2023 Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid

We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet, and replicate the effect on Llama-13B and GPT-J-6B.

Prompt Engineering

Parametrically Retargetable Decision-Makers Tend To Seek Power

no code implementations27 Jun 2022 Alexander Matt Turner, Prasad Tadepalli

We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power.

Decision Making Montezuma's Revenge

On Avoiding Power-Seeking by Artificial Intelligence

no code implementations23 Jun 2022 Alexander Matt Turner

I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.

Decision Making

Avoiding Side Effects in Complex Environments

2 code implementations NeurIPS 2020 Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli

By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead while leading the agent to complete the specified task and avoid many side effects.

Optimal Policies Tend to Seek Power

1 code implementation NeurIPS 2021 Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.

Reinforcement Learning (RL)

Conservative Agency via Attainable Utility Preservation

3 code implementations26 Feb 2019 Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.

Cannot find the paper you are looking for? You can Submit a new open access paper.