Defuse: Debugging Classifiers Through Distilling Unrestricted Adversarial Examples
With the greater proliferation of machine learning models, the imperative of diagnosing and correcting bugs in models has become increasingly clear. As a route to better discover and fix model bugs, we propose failure scenarios: regions on the data manifold that are incorrectly classified by a model. We propose an end-to-end debugging framework called Defuse to use these regions for fixing faulty classifier predictions. The Defuse framework works in three steps. First, Defuse identifies many unrestricted adversarial examples—naturally occurring instances that are misclassified—using a generative model. Next, the procedure distills the misclassified data using clustering. This step both reveals sources of model error and helps facilitate labeling. Last, the method corrects model behavior on the distilled scenarios through an optimization based approach. We illustrate the utility of our framework on a variety of image data sets. We find that Defuse identifies and resolves concerning predictions while maintaining model generalization.
PDF Abstract