no code implementations • 17 Dec 2024 • Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, JunHao Chen, Aimin Yang, Yongmei Zhou
Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task.