Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

Ascertaining that a deep network does not rely on an unknown spurious signal as basis for its output, prior to deployment, is crucial in high stakes settings like healthcare. While many post hoc explanation methods have been shown to be useful for some end tasks, theoretical and empirical evidence has also accumulated that show that these methods may not be faithful or useful. This leaves little guidance for a practitioner or a researcher using these methods in their decision process. To address this gap, we investigate whether three classes of post hoc explanations–feature attribution, concept activation, and training point ranking–can alert a practitioner to a model’s reliance on unknown spurious signals. We test them in two medical domains with plausible spurious signals. In a broad experimental sweep across datasets, models, and spurious signals, we find that the post hoc explanations tested can be used to identify a model’s reliance on a spurious signal, if, the spurious signal is known ahead of time by the practitioner using the explanation method. Otherwise, a search over possible spurious signals and available data is required. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model’s reliance on spurious signals.

PDF Abstract
No code implementations yet. Submit your code now



  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.