1 code implementation • 2 Dec 2023 • Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose
Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted.