[Re] Reproducibility study - Does enforcing diversity in hidden states of LSTM-Attention models improve transparency?

RC 2020 · Frank Verhoef, Pieter Bouwman, Yun Li, Rogier van der Weerd ·

Reproduction study for Towards Transparent and Explainable Attention Models

Scope of Reproducibility
For this reproducibility study, we focus on the main claims made

in this paper:
• The attention weights in standard LSTM attention models do not provide faithful and plausible explanations for its predictions. This is potentially because the
conicity of the LSTM hidden vectors is high.
• Two methods can be applied to reduce conicity: Orthogonalization and Diversity Driven Training. When applying these methods, the resulting attention weights offer more faithful and plausible explanations of the modelʼs predictions, without sacrificing model performance.

Methodology
The paper includes a link to a repository with the code used to generate its results. We follow four investigative routes: (i) Replication: we rerun experiments on datasets from the paper in order to replicate the results, and add the results that are missing in the paper; (ii) Code review: we scrutinize the code to validate its correctness; (iii) Evaluation methodology: we extend the set of evaluation metrics used in the paper with the LIME method, in an attempt to resolve inconclusive results; (iv) Generalization to other architectures: we test whether the authorsʼ claims apply to variations of the base model (more complex forms of attention and a BiLSTM encoder).

Results
We confirm that the Orthogonal and Diversity LSTM achieve similar accuracies as the Vanilla LSTM, while lowering conicity. However, we cannot reproduce the results of several of the experiments in the paper that underlie their claim of better transparency. In addition, a close inspection of the code base reveals some potentially problematic inconsistencies. Despite this, under certain conditions, we do confirm that the Orthogonal and Diversity LSTM can be useful methods to increase transparency. How to formulate these conditions more generally remains unclear and deserves further research. The single input sequence tasks appear to benefit most from the methods. For these tasks, the attention mechanism does not play a critical role for achieving performance.

What was easy/difficult
The codebase of the authors is accessible and can be run easily, with good facilities to prepare datasets and define configurations. The Orthogonalization and Diversity Training methods are well explained in the paper and mostly cleanly implemented. The larger datasets (Amazon and CNN) are difficult to run due to memory requirements and compute times. The codebase can be hard to navigate, a consequence of the choice to accommodate a large variation of models and datasets in one framework.

Communication with original authors
We reached out to the authors on a fundamental but unexplained choice in the model architecture but unfortunately did not hear back before the deadline of our assignment.