Learning to Deceive with Attention-Based Explanations

ACL 2020 · Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, Zachary C. Lipton ·

Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Our method diminishes the total weight assigned to designated impermissible tokens, even when the models can be shown to nevertheless rely on these features to drive predictions. Across multiple models and tasks, our approach manipulates attention weights while paying surprisingly little cost in accuracy. Through a human study, we show that our manipulated attention-based explanations deceive people into thinking that predictions from a model biased against gender minorities do not rely on the gender. Consequently, our results cast doubt on attention's reliability as a tool for auditing algorithms in the context of fairness and accountability.

PDF Abstract ACL 2020 PDF ACL 2020 Abstract

Code

Add Remove Mark official

danishpruthi/deceptive-attention official

MatPrst/deceptive-attention-reprodu…

MatPrst/FACT

Tasks

Add Remove

Fairness

Datasets

SST

Reproducibility Reports

Jan 31 2021

[Re] Reproducing Learning to Deceive With Attention-Based Explanations

RC 2020 · Andrew Harrison, Rahel Habacker, Ard Snijders, Mathias Parisot

We reproduced the authorsʼ results across all models and all available datasets, confirming their findings that attention-based explanations can be manipulated and that mod els can learn to deceive. We also replicated their BERT results using our reimplemented model. There was only one result not as strongly (> 1 S.D.) in their experimental direction.

Results from the Paper

Edit

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning to Deceive with Attention-Based Explanations

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit