Localizing Interpretable Multi-scale informative Patches Derived from Media Classification Task

Deep convolutional neural networks (CNN) always depend on wider receptive field (RF) and more complex non-linearity to achieve state-of-the-art performance, while suffering the increased difficult to interpret how relevant patches contribute the final prediction. In this paper, we construct an interpretable AnchorNet equipped with our carefully designed RFs and linearly spatial aggregation to provide patch-wise interpretability of the input media meanwhile localizing multi-scale informative patches only supervised on media-level labels without any extra bounding box annotations. Visualization of localized informative image and text patches show the superior multi-scale localization capability of AnchorNet. We further use localized patches for downstream classification tasks across widely applied networks. Experimental results demonstrate that replacing the original inputs with their patches for classification can get a clear inference acceleration with only tiny performance degradation, which proves that localized patches can indeed retain the most semantics and evidences of the original inputs.

Results in Papers With Code
(↓ scroll down to see all results)