Search Results for author: Aidan Ewart

Found 2 papers, 1 papers with code

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations26 Feb 2024 Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Sparse Autoencoders Find Highly Interpretable Features in Language Models

1 code implementation15 Sep 2023 Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.

counterfactual Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.