Revisiting Adversarial Robustness of Classifiers With a Reject Option

AAAI Workshop AdvML 2022 · Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, YIngyu Liang, Somesh Jha ·

Adversarial training of deep neural networks (DNNs) is an important defense mechanism that allows a DNN to be robust to input perturbations, that can otherwise result in predictions errors. Recently, there is a growing interest in learning a classifier with a reject (abstain) option that can be more robust to adversarial perturbations by choosing to not return a prediction on inputs where the classifier may be incorrect. A challenge faced with robust learning of a classifier with reject option is that existing works do not have a mechanism to ensure that (very) small perturbations of the input are \textit{not} rejected, when they can in fact be accepted and correctly classified. We first propose a novel metric -- \textit{robust error with rejection} -- that extends the standard definition of robust error to include the rejection of small perturbations. The proposed metric has natural connections to the standard robust error (without rejection), as well as the robust error with rejection proposed in a recent work. Motivated by this metric, we propose novel loss functions and a robust training method -- \textit{stratified adversarial training with rejection} (SATR) -- for a classifier with reject option, where the goal is to accept and correctly-classify small input perturbations, while allowing the rejection of larger input perturbations that cannot be correctly classified. Experiments on well-known image classification DNNs using strong adaptive attack methods validate that SATR can significantly improve the robustness of a classifier with rejection compared to standard adversarial training (with confidence-based rejection) as well as a recently-proposed baseline.

PDF Abstract