3 code implementations • 9 Dec 2023 • Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes.
no code implementations • 12 Oct 2023 • Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner
To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares.
1 code implementation • 20 Aug 2023 • Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid
We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet, and replicate the effect on Llama-13B and GPT-J-6B.
no code implementations • 27 Jun 2022 • Alexander Matt Turner, Prasad Tadepalli
We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power.
no code implementations • 23 Jun 2022 • Alexander Matt Turner
I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.
no code implementations • 23 Jun 2022 • Alexander Matt Turner, Aseem Saxena, Prasad Tadepalli
AI objectives are often hard to specify properly.
2 code implementations • NeurIPS 2020 • Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli
By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead while leading the agent to complete the specified task and avoid many side effects.
1 code implementation • NeurIPS 2021 • Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.
3 code implementations • 26 Feb 2019 • Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.