ADD-Defense: Towards Defending Widespread Adversarial Examples via Perturbation-Invariant Representation

1 Jan 2021 · Dawei Zhou, Tongliang Liu, Bo Han, Nannan Wang, Xinbo Gao ·

Due to vulnerability of machine learning algorithms under adversarial examples, it is challenging to defend against them. Recently, various defenses have been proposed to mitigate negative effects of adversarial examples generated from known attacks. However, these methods have obvious limitations against unknown attacks. Cognitive science investigates that the brain can recognize the same person with any expression by extracting invariant information on the face. Similarly, different adversarial examples share the invariant information retained from original examples. Motivated by this observation, we propose a defense framework ADD-Defense, which extracts the invariant information called \textit{perturbation-invariant representation} (PIR) to defend against widespread adversarial examples. Specifically, realized by adversarial training with additional ability to utilize perturbation-specific information, the PIR is invariant to known attacks and has no perturbation-specific information. Facing the imbalance between widespread unknown attacks and limited known attacks, the PIR is expected to generalize well on unknown attacks via being matched to a Gaussian prior distribution. In this way, the PIR is invariant to both known and unknown attacks. Once the PIR is learned, we can generate an example without malicious perturbations as the output. We evaluate our ADD-Defense using various pixel-constrained and spatially-constrained attacks, especially BPDA and AutoAttack. The empirical results illustrate that our ADD-Defense is robust to widespread adversarial examples.

PDF Abstract