no code implementations • 23 Feb 2024 • Heegyu Kim, Sehyun Yuk, Hyunsouk Cho
We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.
1 code implementation • 11 Dec 2023 • Heegyu Kim, Hyunsouk Cho
Our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original CTG methods while preserving the generation performance of the language model.