Defending against black-box adversarial attacks with gradient-free trained sign activation neural networks

1 Jan 2021  ·  Yunzhe Xue, Meiyan Xie, Zhibo Yang, Usman Roshan ·

Binary neural networks have been shown to be more adversarially robust than full-precision networks but their improvements are marginal. We propose an ensemble of sign activation neural networks trained with a novel gradient free stochastic coordinate descent algorithm. On the CIFAR10 image benchmark we show that our model has a much higher minimum distortion, as measured by the black box boundary attack method HopSkipJump, than ensembles of binary, full precision, and convolutional neural networks, and than random forests while attaining comparable clean test accuracy. We also show that our ensemble model attains higher accuracy on adversarial examples in text black box attacks across several datasets and is also hard to attack on medical ECG data. In order to explain our model's robustness we turn to non-transferability. We empirically measure the probability that a black box adversary targeting a single network in our ensemble will transfer to other networks and find this to be lowest compared to ensembles of other networks and models. Thus an image requires greater distortion to fool the majority of networks in our ensemble. The non-transferability in our ensemble also makes it a powerful defense to substitute model black box attacks that we show require a much greater distortion than binary and full precision networks to bring our model to zero adversarial accuracy. This property of non-transferability in our ensemble arises naturally from the non-convexity of sign networks and randomization in our training algorithm without any adversarial defense effort.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here