Attention Modules

Blender is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single convolution layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.

The inputs of the blender module are bottom-level bases $\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First RoIPool of Mask R-CNN to crop bases with each proposal $\mathbf{p}_{d}$ and then resize the region to a fixed size $R \times R$ feature map $\mathbf{r}_{d}$

$$ \mathbf{r}_{d}=\operatorname{RoIPool}_{R \times R}\left(\mathbf{B}, \mathbf{p}_{d}\right), \quad \forall d \in{1 \ldots D} $$

More specifically, asampling ratio 1 is used for RoIAlign, i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, FCOS prediction results are used.

The attention size $M$ is smaller than $R$. We interpolate $\mathbf{a}_{d}$ from $M \times M$ to $R \times R$, into the shapes of $R=\left(\mathbf{r}_{d} \mid d=1 \ldots D\right)$

$$ \mathbf{a}_{d}^{\prime}=\text { interpolate }_{M \times M \rightarrow R \times R}\left(\mathbf{a}_{d}\right), \quad \forall d \in{1 \ldots D} $$

Then $\mathbf{a}_{d}^{\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\mathbf{s}_{d}$.

$$ \mathbf{s}_{d}=\operatorname{softmax}\left(\mathbf{a}_{d}^{\prime}\right), \quad \forall d \in{1 \ldots D} $$

Then we apply element-wise product between each entity $\mathbf{r}_{d}, \mathbf{s}_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\mathbf{m}_{d}:$

$$ \mathbf{m}_{d}=\sum_{k=1}^{K} \mathbf{s}_{d}^{k} \circ \mathbf{r}_{d}^{k}, \quad \forall d \in{1 \ldots D} $$

where $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.

Source: BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation


Paper Code Results Date Stars