Evaluation Methodology for Attacks Against Confidence Thresholding Models

ICLR 2019  ·  Ian Goodfellow, Yao Qin, David Berthelot ·

Current machine learning algorithms can be easily fooled by adversarial examples. One possible solution path is to make models that use confidence thresholding to avoid making mistakes. Such models refuse to make a prediction when they are not confident of their answer. We propose to evaluate such models in terms of tradeoff curves with the goal of high success rate on clean examples and low failure rate on adversarial examples. Existing untargeted attacks developed for models that do not use confidence thresholding tend to underestimate such models' vulnerability. We propose the MaxConfidence family of attacks, which are optimal in a variety of theoretical settings, including one realistic setting: attacks against linear models. Experiments show the attack attains good results in practice. We show that simple defenses are able to perform well on MNIST but not on CIFAR, contributing further to previous calls that MNIST should be retired as a benchmarking dataset for adversarial robustness research. We release code for these evaluations as part of the cleverhans (Papernot et al 2018) library (ICLR reviewers should be careful not to look at who contributed these features to cleverhans to avoid de-anonymizing this submission).

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here