Search Results for author: Buck Shlegeris

Found 9 papers, 4 papers with code

AI Control: Improving Safety Despite Intentional Subversion

no code implementations12 Dec 2023 Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding.

Generalized Wick Decompositions

no code implementations10 Oct 2023 Chris MacLeod, Evgenia Nitishinskaya, Buck Shlegeris

We review the cumulant decomposition (a way of decomposing the expectation of a product of random variables (e. g. $\mathbb{E}[XYZ]$) into a sum of terms corresponding to partitions of these variables.)

Benchmarks for Detecting Measurement Tampering

1 code implementation29 Aug 2023 Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.

Language models are better than humans at next-token prediction

no code implementations21 Dec 2022 Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.

Question Answering

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

3 code implementations1 Nov 2022 Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components.

Language Modelling

Polysemanticity and Capacity in Neural Networks

no code implementations4 Oct 2022 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.

Adversarial Training for High-Stakes Reliability

no code implementations3 May 2022 Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.

Text Generation Vocal Bursts Intensity Prediction

Supervising strong learners by amplifying weak experts

3 code implementations19 Oct 2018 Paul Christiano, Buck Shlegeris, Dario Amodei

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior.

Cannot find the paper you are looking for? You can Submit a new open access paper.