Robust Black Box Explanations Under Distribution Shift

As machine learning black boxes are increasingly being deployed in real-world applications, there has been a growing interest in developing post hoc explanations that summarize the behaviors of these black box models. However, existing algorithms for generating such explanations have been shown to lack robustness with respect to shifts in the underlying data distribution. In this paper, we propose a novel framework for generating robust explanations of black box models based on adversarial training. In particular, our framework optimizes a minimax objective that aims to construct the highest fidelity explanation with respect to the worst-case over a set of distribution shifts. We instantiate this algorithm for explanations in the form of linear models and decision sets by devising the required optimization procedures. To the best of our knowledge, this work makes the first attempt at generating post hoc explanations that are robust to a general class of distribution shifts that are of practical interest. Experimental evaluation with real-world and synthetic datasets demonstrates that our approach substantially improves the robustness of explanations without sacrificing their fidelity on the original data distribution.

PDF ICML 2020 PDF
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here