Search Results for author: Simon Lermen

Found 4 papers, 1 papers with code

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

no code implementations26 Nov 2023 Simon Lermen, Ondřej Kvapil

There has been increasing interest in evaluations of language models for a variety of risks and characteristics.

Natural Language Understanding

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

no code implementations31 Oct 2023 Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks.

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

no code implementations31 Oct 2023 Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Llama 2-Chat is a collection of large language models that Meta developed and released to the public.

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

1 code implementation3 Jul 2023 Teun van der Weij, Simon Lermen, Leon Lang

Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities.

Cannot find the paper you are looking for? You can Submit a new open access paper.