no code implementations • 14 Apr 2025 • Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm.
no code implementations • 7 Apr 2025 • Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving
As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm.
no code implementations • 28 Jan 2025 • Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe.
1 code implementation • 18 Dec 2024 • Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger
Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
no code implementations • 17 Dec 2024 • Alex Mallen, Charlie Griffin, Alessandro Abate, Buck Shlegeris
However, to subvert a protocol, an AI system must be able to reliably generate optimal plans in each context; coordinate plans with other instances of itself without communicating; and take actions with well-calibrated probabilities.
no code implementations • 26 Nov 2024 • Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them.
no code implementations • 29 Oct 2024 • Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq
We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming.
no code implementations • 28 Oct 2024 • Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, David Duvenaud
We develop a set of related threat models and evaluations.
no code implementations • 12 Sep 2024 • Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary.
1 code implementation • 14 Jun 2024 • Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments.
1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
1 code implementation • 12 Dec 2023 • Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding.
no code implementations • 10 Oct 2023 • Chris MacLeod, Evgenia Nitishinskaya, Buck Shlegeris
We review the cumulant decomposition (a way of decomposing the expectation of a product of random variables (e. g. $\mathbb{E}[XYZ]$) into a sum of terms corresponding to partitions of these variables.)
1 code implementation • 29 Aug 2023 • Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.
1 code implementation • 21 Dec 2022 • Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.
6 code implementations • 1 Nov 2022 • Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.
no code implementations • 4 Oct 2022 • Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris
We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
3 code implementations • 19 Oct 2018 • Paul Christiano, Buck Shlegeris, Dario Amodei
Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior.