1 code implementation • 11 Oct 2024 • Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots.
no code implementations • 8 Oct 2024 • Simon Lermen, Mateusz Dziemian, Govind Pimpale
Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3. 1 Instruct models are willing to perform most harmful tasks without modifications.