More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM.
This work aims to address this gap by offering a theoretical characterization of the trade-off between detection and false positive rates for stateful defenses.
Such stateful defenses aim to defend against black-box attacks by tracking the query history and detecting and rejecting queries that are "similar" and thus preventing black-box attacks from finding useful gradients and making progress towards finding adversarial attacks within a reasonable query budget.
D4 uses an ensemble of models over disjoint subsets of the frequency spectrum to significantly improve adversarial robustness.
The potential for fraudulent claims based on such generated 'fake' medical images is significant, and we demonstrate successful attacks on both X-rays and retinal fundus image modalities.
T-Miner employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger.
AI-manipulated videos, commonly known as deepfakes, are an emerging problem.
We address three key requirements for practical attacks for the real-world: 1) automatically constraining the size and shape of the attack so it can be applied with stickers, 2) transform-robustness, i. e., robustness of a attack to environmental physical variations such as viewpoint and lighting changes, and 3) supporting attacks in not only white-box, but also black-box hard-label scenarios, so that the adversary can attack proprietary models.