Blender is a proposalbased instance mask generation module which incorporates rich instancelevel information with accurate dense pixel features. A single convolution layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.
The inputs of the blender module are bottomlevel bases $\mathbf{B}$, the selected toplevel attentions $A$ and bounding box proposals $P$. First RoIPool of Mask RCNN to crop bases with each proposal $\mathbf{p}_{d}$ and then resize the region to a fixed size $R \times R$ feature map $\mathbf{r}_{d}$
$$ \mathbf{r}_{d}=\operatorname{RoIPool}_{R \times R}\left(\mathbf{B}, \mathbf{p}_{d}\right), \quad \forall d \in{1 \ldots D} $$
More specifically, asampling ratio 1 is used for RoIAlign, i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, FCOS prediction results are used.
The attention size $M$ is smaller than $R$. We interpolate $\mathbf{a}_{d}$ from $M \times M$ to $R \times R$, into the shapes of $R=\left(\mathbf{r}_{d} \mid d=1 \ldots D\right)$
$$ \mathbf{a}_{d}^{\prime}=\text { interpolate }_{M \times M \rightarrow R \times R}\left(\mathbf{a}_{d}\right), \quad \forall d \in{1 \ldots D} $$
Then $\mathbf{a}_{d}^{\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\mathbf{s}_{d}$.
$$ \mathbf{s}_{d}=\operatorname{softmax}\left(\mathbf{a}_{d}^{\prime}\right), \quad \forall d \in{1 \ldots D} $$
Then we apply elementwise product between each entity $\mathbf{r}_{d}, \mathbf{s}_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\mathbf{m}_{d}:$
$$ \mathbf{m}_{d}=\sum_{k=1}^{K} \mathbf{s}_{d}^{k} \circ \mathbf{r}_{d}^{k}, \quad \forall d \in{1 \ldots D} $$
where $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.
Source: BlendMask: TopDown Meets BottomUp for Instance SegmentationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Novel View Synthesis  4  8.16% 
Semantic Segmentation  4  8.16% 
Optical Flow Estimation  3  6.12% 
Object Detection  3  6.12% 
Instance Segmentation  3  6.12% 
Neural Rendering  2  4.08% 
Computed Tomography (CT)  2  4.08% 
Fairness  2  4.08% 
Style Transfer  2  4.08% 
Component  Type 


RoIAlign

RoI Feature Extractors  
RoIPool

RoI Feature Extractors  
Softmax

Output Functions 