Empirical robustification of pre-trained classifiers

Most pre-trained classifiers, though they may work extremely well on the domain they were trained upon, are not trained in a robust fashion, and therefore are sensitive to adversarial attacks. A recent technique, denoised-smoothing, demonstrated that it was possible to create certifiably robust classifiers from a pre-trained classifier (without any retraining) by pre-pending a denoising network and wrapping the entire pipeline within randomized smoothing. However, this is a costly procedure, which requires multiple queries due to the randomized smoothing element, and which ultimately is very dependent on the quality of the denoiser. In this paper, we demonstrate that a more conventional “adversarial training” approach also works when applied to this robustification process. Specifically, we show that by training an image-to-image translation model, prepended to a pre-trained classifier, with losses that optimize for both the fidelity of the image reconstruction and the adversarial performance of the end-to-end system, we can robustify pre-trained classifiers to a higher empirical degree of accuracy than denoised smoothing. Further, these robustifers are also transferable to some degree across multiple classifiers and even some architectures, illustrating that in some real sense they are removing the “adversarial manifold” from the input data, a task that has traditionally been very challenging for “conventional” preprocessing methods.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here