Search Results for author: Aleksandar Makelov

Found 3 papers, 2 papers with code

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation28 Nov 2023 Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Rethinking Backdoor Attacks

no code implementations19 Jul 2023 Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.

Backdoor Attack

Cannot find the paper you are looking for? You can Submit a new open access paper.