Achieving Explainability in a Visual Hard Attention Model through Content Prediction

1 Jan 2021  ·  Samrudhdhi Bharatkumar Rangrej, James J. Clark ·

A visual hard attention model actively selects and observes a sequence of subregions in an image to make a prediction. Unlike in the deep convolution network, in hard attention it is explainable which regions of the image contributed to the prediction. However, the attention policy used by the model to select these regions is not explainable. Moreover, a suboptimal parameterization of the policy leads to a suboptimal model. In this paper, we attempt at designing an efficient hard attention model for the image classification task. The attention policy used by our model is non-parametric and explainable. The model estimates expected information gain (EIG) obtained from attending various regions by predicting their content ahead of time. It compares EIGs using Bayesian Optimal Experiment Design and attends the region with maximum EIG. The model aggregates features of the observed regions in a recurrent state, which is used by a classifier and a variational autoencoder with the normalizing flows to predict the class label and the image content, respectively. We train our model with a differentiable objective, optimized using gradient descent, and test it on several datasets. The performance of our model is comparable to or better than the baseline models.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods