no code implementations • 1 Feb 2024 • Dawn Lu, Nina Rimsky
We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model.
3 code implementations • 9 Dec 2023 • Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes.