Image Model Blocks

Split Attention

Introduced by Zhang et al. in ResNeSt: Split-Attention Networks

A Split Attention block enables attention across feature-map groups. As in ResNeXt blocks, the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\mathcal{F}_1, \mathcal{F}_2, \cdots\mathcal{F}_G$} to each individual group, then the intermediate representation of each group is $U_i = \mathcal{F}_i\left(X\right)$, for $i \in$ {$1, 2, \cdots{G}$}.

A combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is $\hat{U}^k = \sum_{j=R(k-1)+1}^{R k} U_j $, where $\hat{U}^k \in \mathbb{R}^{H\times W\times C/K}$ for $k\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. Global contextual information with embedded channel-wise statistics can be gathered with global average pooling across spatial dimensions $s^k\in\mathbb{R}^{C/K}$. Here the $c$-th component is calculated as:

$$ s^k_c = \frac{1}{H\times W} \sum_{i=1}^H\sum_{j=1}^W \hat{U}^k_c(i, j). $$

A weighted fusion of the cardinal group representation $V^k\in\mathbb{R}^{H\times W\times C/K}$ is aggregated using channel-wise soft attention, where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:

$$ V^k_c=\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} , $$

where $a_i^k(c)$ denotes a (soft) assignment weight given by:

$$ a_i^k(c) = \begin{cases} \frac{exp(\mathcal{G}^c_i(s^k))}{\sum_{j=0}^R exp(\mathcal{G}^c_j(s^k))} & \quad\textrm{if } R>1, \ \frac{1}{1+exp(-\mathcal{G}^c_i(s^k))} & \quad\textrm{if } R=1,\ \end{cases} $$

and mapping $\mathcal{G}_i^c$ determines the weight of each split for the $c$-th channel based on the global context representation $s^k$.

Source: ResNeSt: Split-Attention Networks


Paper Code Results Date Stars