The Manifold Assumption and Defenses Against Adversarial Perturbations

ICLR 2018 · Xi Wu, Uyeong Jang, Lingjiao Chen, Somesh Jha ·

In the adversarial-perturbation problem of neural networks, an adversary starts with a neural network model $F$ and a point $\bfx$ that $F$ classifies correctly, and applies a \emph{small perturbation} to $\bfx$ to produce another point $\bfx'$ that $F$ classifies \emph{incorrectly}. In this paper, we propose taking into account \emph{the inherent confidence information} produced by models when studying adversarial perturbations, where a natural measure of ``confidence'' is \|F(\bfx)\|_\infty$ (i.e. how confident $F$ is about its prediction?). Motivated by a thought experiment based on the manifold assumption, we propose a ``goodness property'' of models which states that \emph{confident regions of a good model should be well separated}. We give formalizations of this property and examine existing robust training objectives in view of them. Interestingly, we find that a recent objective by Madry et al. encourages training a model that satisfies well our formal version of the goodness property, but has a weak control of points that are wrong but with low confidence. However, if Madry et al.'s model is indeed a good solution to their objective, then good and bad points are now distinguishable and we can try to embed uncertain points back to the closest confident region to get (hopefully) correct predictions. We thus propose embedding objectives and algorithms, and perform an empirical study using this method. Our experimental results are encouraging: Madry et al.'s model wrapped with our embedding procedure achieves almost perfect success rate in defending against attacks that the base model fails on, while retaining good generalization behavior.

PDF Abstract