1 code implementation • 19 Jan 2025 • Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples.
no code implementations • 14 Jan 2025 • James Chua, Owain Evans
We refer to these models as Inference-Time-Compute (ITC) models.
1 code implementation • 17 Oct 2024 • Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans
We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
no code implementations • 21 Jul 2024 • Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
1 code implementation • 8 Mar 2024 • James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin
Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%.