Spatial and Channel-wise Attention-based Convolutional Neural Network

Introduced by Chen et al. in SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

As CNN features are naturally spatial, channel-wise and multi-layer, Chen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). It was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \in \mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The spatial attention model is:

\begin{align} a(h_{t-1}, X) &= \tanh(Conv_1^{1 \times 1}(X) \oplus W_1 h_{t-1}) \end{align}

\begin{align} \Phi_s(h_{t-1}, X) &= \text{Softmax}(Conv_2^{1 \times 1}(a(h_{t-1}, X)))
\end{align}

where $\oplus$ represents addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$: \begin{align} b(h_{t-1}, X) &= \tanh((W_2\text{GAP}(X)+b_2)\oplus W_1h_{t-1}) \end{align} \begin{align} \Phi_c(h_{t-1}, X) &= \text{Softmax}(W_3(b(h_{t-1}, X))+b_3)
\end{align} Overall, the SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X \Phi_c(h_{t-1}, X)), \Phi_c(h_{t-1}, X)) \end{align} and if spatial attention comes first: \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X), \Phi_c(h_{t-1}, X \Phi_s(h_{t-1}, X))) \end{align} where $f(\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.

Unlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.

Source: SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Captioning	2	66.67%
Sentence	1	33.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention Mechanisms