1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
no code implementations • 12 Dec 2023 • Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding.
no code implementations • 10 Oct 2023 • Chris MacLeod, Evgenia Nitishinskaya, Buck Shlegeris
We review the cumulant decomposition (a way of decomposing the expectation of a product of random variables (e. g. $\mathbb{E}[XYZ]$) into a sum of terms corresponding to partitions of these variables.)
1 code implementation • 29 Aug 2023 • Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.
no code implementations • 21 Dec 2022 • Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.
3 code implementations • 1 Nov 2022 • Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.
no code implementations • 4 Oct 2022 • Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris
We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
3 code implementations • 19 Oct 2018 • Paul Christiano, Buck Shlegeris, Dario Amodei
Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior.