Search Results for author: Nina Rimsky

Found 2 papers, 1 papers with code

Investigating Bias Representations in Llama 2 Chat via Activation Steering

no code implementations • 1 Feb 2024 • Dawn Lu, Nina Rimsky

We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model.

Paper
Add Code

Steering Llama 2 via Contrastive Activation Addition

3 code implementations • 9 Dec 2023 • Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes.

Multiple-choice

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.