1 code implementation • 29 Aug 2023 • Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.