Search Results for author: Max Nadeau

Found 5 papers, 3 papers with code

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

1 code implementation12 Sep 2023 Maximilian Li, Xander Davies, Max Nadeau

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks.

Text Generation

Benchmarks for Detecting Measurement Tampering

1 code implementation29 Aug 2023 Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization.

Discovering Variable Binding Circuitry with Desiderata

no code implementations7 Jul 2023 Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits.

Robust Feature-Level Adversaries are Interpretability Tools

2 code implementations7 Oct 2021 Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.

Cannot find the paper you are looking for? You can Submit a new open access paper.