Search Results for author: Neil Gong

Found 4 papers, 2 papers with code

A Transfer Attack to Image Watermarks

no code implementations22 Mar 2024 Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong

The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature.

GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

1 code implementation21 Feb 2024 Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong

In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs.

Llama

Mendata: A Framework to Purify Manipulated Training Data

no code implementations3 Dec 2023 Zonghao Huang, Neil Gong, Michael K. Reiter

Untrusted data used to train a model might have been manipulated to endow the learned model with hidden properties that the data contributor might later exploit.

Data Poisoning

SneakyPrompt: Jailbreaking Text-to-image Generative Models

1 code implementation20 May 2023 Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones.

Reinforcement Learning (RL) Semantic Similarity +1

Cannot find the paper you are looking for? You can Submit a new open access paper.