Search Results for author: Jeffrey Ladish

Found 5 papers, 2 papers with code

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

1 code implementation3 Jun 2024 Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt

By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies.

Red Teaming

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

no code implementations31 Oct 2023 Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Llama 2-Chat is a collection of large language models that Meta developed and released to the public.

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

no code implementations31 Oct 2023 Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

With a budget of less than \$200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model.

Red Teaming Safety Alignment

Cannot find the paper you are looking for? You can Submit a new open access paper.