Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspective

Adversarial training (AT) is the most commonly used mechanism to improve the robustness of deep neural networks. Recently, a novel adversarial attack against intermediate layers exploits the extra fragility of adversarially trained networks to output incorrect predictions. The result implies the insufficiency in the searching space of the adversarial perturbation in adversarial training. To straighten out the reason for the effectiveness of the intermediate-layer attack, we interpret the forward propagation as the Clustering Effect, characterizing that the intermediate-layer representations of neural networks for samples i.i.d. to the training set with the same label are similar, and we theoretically prove the existence of Clustering Effect by corresponding Information Bottleneck Theory. We afterward observe that the intermediate-layer attack disobeys the clustering effect of the AT-trained model. Inspired by these significant observations, we propose a regularization method to extend the perturbation searching space during training, named sufficient adversarial training (SAT). We give a proven robustness bound of neural networks through rigorous mathematical proof. The experimental evaluations manifest the superiority of SAT over other state-of-the-art AT mechanisms in defending against adversarial attacks against both output and intermediate layers. Our code and Appendix can be found at https://github.com/clustering-effect/SAT.

PDF Abstract


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here