Search Results for author: Buck Shlegeris

Found 19 papers, 8 papers with code

Ctrl-Z: Controlling AI Agents via Resampling

no code implementations14 Apr 2025 Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm.

AI Agent Blocking

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

no code implementations7 Apr 2025 Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm.

A sketch of an AI control safety case

no code implementations28 Jan 2025 Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe.

Alignment faking in large language models

1 code implementation18 Dec 2024 Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

Large Language Model

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols

no code implementations17 Dec 2024 Alex Mallen, Charlie Griffin, Alessandro Abate, Buck Shlegeris

However, to subvert a protocol, an AI system must be able to reliably generate optimal plans in each context; coordinate plans with other instances of itself without communicating; and take actions with well-calibrated probabilities.

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

no code implementations26 Nov 2024 Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them.

Code Generation

Towards evaluations-based safety cases for AI scheming

no code implementations29 Oct 2024 Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq

We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming.

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

no code implementations12 Sep 2024 Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary.

Decision Making Red Teaming

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

1 code implementation14 Jun 2024 Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments.

Language Modelling Large Language Model

AI Control: Improving Safety Despite Intentional Subversion

1 code implementation12 Dec 2023 Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding.

Red Teaming

Generalized Wick Decompositions

no code implementations10 Oct 2023 Chris MacLeod, Evgenia Nitishinskaya, Buck Shlegeris

We review the cumulant decomposition (a way of decomposing the expectation of a product of random variables (e. g. $\mathbb{E}[XYZ]$) into a sum of terms corresponding to partitions of these variables.)

Benchmarks for Detecting Measurement Tampering

1 code implementation29 Aug 2023 Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.

Language models are better than humans at next-token prediction

1 code implementation21 Dec 2022 Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.

Question Answering

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

6 code implementations1 Nov 2022 Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.

Language Modeling Language Modelling

Polysemanticity and Capacity in Neural Networks

no code implementations4 Oct 2022 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.

Adversarial Training for High-Stakes Reliability

no code implementations3 May 2022 Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.

Text Generation Vocal Bursts Intensity Prediction

Supervising strong learners by amplifying weak experts

3 code implementations19 Oct 2018 Paul Christiano, Buck Shlegeris, Dario Amodei

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior.

Cannot find the paper you are looking for? You can Submit a new open access paper.