1 code implementation • 11 Oct 2024 • Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots.
2 code implementations • 1 Aug 2024 • Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika
Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use.
4 code implementations • 6 Jun 2024 • Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
Existing techniques aimed at improving alignment, such as refusal training, are often bypassed.
no code implementations • 11 Apr 2023 • Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou
In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i. e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language.
Ranked #20 on
Code Generation
on MBPP