Search Results for author: Heegyu Kim

Found 2 papers, 1 papers with code

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

no code implementations • 23 Feb 2024 • Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.

Paper
Add Code

GTA: Gated Toxicity Avoidance for LM Performance Preservation

1 code implementation • 11 Dec 2023 • Heegyu Kim, Hyunsouk Cho

Our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original CTG methods while preserving the generation performance of the language model.

GPT-4 Language Modelling +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.