1 code implementation • 12 Sep 2023 • Maximilian Li, Xander Davies, Max Nadeau
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks.
1 code implementation • 29 Aug 2023 • Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 7 Jul 2023 • Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits.
2 code implementations • 7 Oct 2021 • Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman
We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.