no code implementations • 4 Sep 2023 • Raz Lapid, Ron Langberg, Moshe Sipper
The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs.