Understanding Adversarial Attacks on Autoencoders

1 Jan 2021 · Elsa Riachi, Frank Rudzicz ·

Adversarial vulnerability is a fundamental limitation of deep neural networks which remains poorly understood. Recent work suggests that adversarial attacks on deep neural network classifiers exploit the fact that non-robust models rely on superficial statistics to form predictions. While such features are semantically meaningless, they are strongly predictive of the input’s label, allowing non-robust networks to achieve good generalization on unperturbed test inputs. However, this hypothesis fails to explain why autoencoders are also vulnerable to adversarial attacks, despite achieving low reconstruction error on clean inputs. We show that training an autoencoder on adversarial input-target pairs leads to low reconstruction error on the standard test set, suggesting that adversarial attacks on autoencoders are predictive. In this work, we study the predictive power of adversarial examples on autoencoders through the lens of compressive sensing. We characterize the relationship between adversarial perturbations and target inputs and reveal that training autoencoders on adversarial input-target pairs is a form of knowledge distillation, achieved by learning to attenuate structured noise.

PDF Abstract