Feature Extractors

Scale-wise Feature Aggregation Module

Introduced by Zhao et al. in M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

SFAM, or Scale-wise Feature Aggregation Module, is a feature extraction block from the M2Det architecture. It aims to aggregate the multi-level multi-scale features generated by Thinned U-Shaped Modules into a multi-level feature pyramid.

The first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as $\mathbf{X} =[\mathbf{X}_1,\mathbf{X}_2,\dots,\mathbf{X}_i]$, where $\mathbf{X}_i = \text{Concat}(\mathbf{x}_i^1,\mathbf{x}_i^2,\dots,\mathbf{x}_i^L) \in \mathbb{R}^{W_{i}\times H_{i}\times C}$ refers to the features of the $i$-th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths.

However, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use global average pooling to generate channel-wise statistics $\mathbf{z} \in \mathbb{R}^C$ at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:

$$ \mathbf{s} = \mathbf{F}_{ex}(\mathbf{z},\mathbf{W}) = \sigma(\mathbf{W}_{2} \delta(\mathbf{W}_{1}\mathbf{z})), $$

where $\sigma$ refers to the ReLU function, $\delta$ refers to the sigmoid function, $\mathbf{W}_{1} \in \mathbb{R}^{\frac{C}{r}\times C}$ , $\mathbf{W}_{2} \in \mathbb{R}^{C\times \frac{C}{r}}$, r is the reduction ratio ($r=16$ in our experiments). The final output is obtained by reweighting the input $\mathbf{X}$ with activation $\mathbf{s}$:

$$ \tilde{\mathbf{X}}_i^c = \mathbf{F}_{scale}(\mathbf{X}_i^c,s_c) = s_c \cdot \mathbf{X}_i^c, $$

where $\tilde{\mathbf{X}_i} = [\tilde{\mathbf{X}}_i^1,\tilde{\mathbf{X}}_i^2,...,\tilde{\mathbf{X}}_i^C]$, each of the features is enhanced or weakened by the rescaling operation.

Source: M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
3D Feature Matching 1 12.50%
document understanding 1 12.50%
Text Detection 1 12.50%
Text Spotting 1 12.50%
3D Reconstruction 1 12.50%
Anomaly Detection 1 12.50%
Video Prediction 1 12.50%
Object Detection 1 12.50%

Categories