NETWORK ROBUSTNESS TO PCA PERTURBATIONS
A key challenge in analyzing neural networks' robustness is identifying input features for which networks are robust to perturbations. Existing work focuses on direct perturbations to the inputs, thereby studies network robustness to the lowest-level features. In this work, we take a new approach and study the robustness of networks to the inputs' semantic features. We show a black-box approach to determine features for which a network is robust or weak. We leverage these features to obtain provably robust neighborhoods defined using robust features and adversarial examples defined by perturbing weak features. We evaluate our approach with PCA features. We show (1) provably robust neighborhoods are larger: on average by 1.8x and up to 4.5x, compared to the standard neighborhoods, and (2) our adversarial examples are generated using at least 8.7x fewer queries and have at least 2.8x lower L2 distortion compared to state-of-the-art. We further show that our attack is effective even against ensemble adversarial training.
PDF Abstract