Lower Bounds on the Robustness of Fixed Feature Extractors to Test-time Adversaries

29 Sep 2021 · Arjun Nitin Bhagoji, Daniel Cullina, Ben Zhao ·

Understanding the robustness of machine learning models to adversarial examples generated by test-time adversaries is a problem of great interest. Recent theoretical work has derived lower bounds on how robust \emph{any model} can be, when a data distribution and attacker constraints are specified. However, these bounds only apply to arbitrary classification functions and do not account for specific architectures and models used in practice, such as neural networks. In this paper, we develop a methodology to analyze the robustness of fixed feature extractors, which in turn provide bounds on the robustness of any classifier trained on top of it. In other words, this indicates how robust the representation obtained from that extractor is with respect to a given adversary. Our bounds hold for arbitrary feature extractors. The tightness of these bounds relies on the effectiveness of the method used to find collisions between pairs of perturbed examples at deeper layers. For linear feature extractors, we provide closed-form expressions for collision finding while for arbitrary feature extractors, we propose a bespoke algorithm based on the iterative solution of a convex program that provably finds collisions. We utilize our bounds to identify the layers of robustly trained models that contribute the most to a lack of robustness, as well as compare the same layer across different training methods to provide a quantitative comparison of their relative robustness. Our experiments establish that each of the following lead to a measurable drop in robustness: i) layers that linearly reduce dimension, ii) sparsity induced by ReLU activations and, iii) mismatches in the attacker constraints at train and test time. These findings point towards future design considerations for robust models that arise from our methodology.

PDF Abstract