no code implementations • 22 Mar 2024 • Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong
The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature.
1 code implementation • 21 Feb 2024 • Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong
In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs.
no code implementations • 3 Dec 2023 • Zonghao Huang, Neil Gong, Michael K. Reiter
Untrusted data used to train a model might have been manipulated to endow the learned model with hidden properties that the data contributor might later exploit.
1 code implementation • 20 May 2023 • Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao
Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones.