Search Results for author: Mateusz Dziemian

Found 2 papers, 1 papers with code

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

1 code implementation11 Oct 2024 Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots.

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

no code implementations8 Oct 2024 Simon Lermen, Mateusz Dziemian, Govind Pimpale

Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3. 1 Instruct models are willing to perform most harmful tasks without modifications.

Language Modeling Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.