Search Results for author: Heegyu Kim

Found 2 papers, 1 papers with code

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

no code implementations23 Feb 2024 Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.

GTA: Gated Toxicity Avoidance for LM Performance Preservation

1 code implementation11 Dec 2023 Heegyu Kim, Hyunsouk Cho

Our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original CTG methods while preserving the generation performance of the language model.

GPT-4 Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.