no code implementations • 25 Jan 2024 • Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell
The effectiveness of an audit, however, depends on the degree of system access granted to auditors.
1 code implementation • 15 Sep 2023 • Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.
no code implementations • 5 May 2023 • Lee Sharkey
The ability of neural networks to represent more features than neurons makes interpreting them challenging.
no code implementations • 21 Dec 2022 • Lee Sharkey
The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values.
no code implementations • 22 Nov 2022 • Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy
Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned.
4 code implementations • 28 May 2021 • Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger
We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).