Search Results for author: Lee Sharkey

Found 6 papers, 2 papers with code

Sparse Autoencoders Find Highly Interpretable Features in Language Models

1 code implementation15 Sep 2023 Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.

counterfactual Language Modelling +1

A technical note on bilinear layers for interpretability

no code implementations5 May 2023 Lee Sharkey

The ability of neural networks to represent more features than neurons makes interpreting them challenging.

Circumventing interpretability: How to defeat mind-readers

no code implementations21 Dec 2022 Lee Sharkey

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values.

Interpreting Neural Networks through the Polytope Lens

no code implementations22 Nov 2022 Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned.

Goal Misgeneralization in Deep Reinforcement Learning

4 code implementations28 May 2021 Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).

Navigate Out-of-Distribution Generalization +2

Cannot find the paper you are looking for? You can Submit a new open access paper.