ASFF, or Adaptively Spatial Feature Fusion, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scaleinvariance of features.
ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, i.e., some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in backpropagation; (2) it is agnostic to the backbone model and it is applied to singleshot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.
Let $\mathbf{x}_{ij}^{n\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:
$$ \mathbf{y}_{ij}^l = \alpha^l_{ij} \cdot \mathbf{x}_{ij}^{1\rightarrow l} + \beta^l_{ij} \cdot \mathbf{x}_{ij}^{2\rightarrow l} +\gamma^l_{ij} \cdot \mathbf{x}_{ij}^{3\rightarrow l}, $$
where $\mathbf{y}_{ij}^l$ implies the $(i,j)$th vector of the output feature maps $\mathbf{y}^l$ among channels. $\alpha^l_{ij}$, $\beta^l_{ij}$ and $\gamma^l_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\alpha^l_{ij}$, $\beta^l_{ij}$ and $\gamma^l_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\alpha^l_{ij}+\beta^l_{ij}+\gamma^l_{ij}=1$ and $\alpha^l_{ij},\beta^l_{ij},\gamma^l_{ij} \in [0,1]$, and
$$ \alpha^l_{ij} = \frac{e^{\lambda^l_{\alpha_{ij}}}}{e^{\lambda^l_{\alpha_{ij}}} + e^{\lambda^l_{\beta_{ij} }} + e^{\lambda^l_{\gamma_{ij}}}}. $$
Here $\alpha^l_{ij}$, $\beta^l_{ij}$ and $\gamma^l_{ij}$ are defined by using the softmax function with $\lambda^l_{\alpha_{ij}}$, $\lambda^l_{\beta_{ij}}$ and $\lambda^l_{\gamma_{ij}}$ as control parameters respectively. We use $1\times1$ convolution layers to compute the weight scalar maps $\mathbf{\lambda}^l_\alpha$, $\mathbf{\lambda}^l_\beta$ and $\mathbf{\lambda}^l_\gamma$ from $\mathbf{x}^{1\rightarrow l}$, $\mathbf{x}^{2\rightarrow l}$ and $\mathbf{x}^{3\rightarrow l}$ respectively, and they can thus be learned through standard backpropagation.
With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of YOLOv3.
Source: Learning Spatial Fusion for SingleShot Object DetectionPaper  Code  Results  Date  Stars 

Component  Type 


1x1 Convolution

Convolutions  
Convolution

Convolutions  
Max Pooling

Pooling Operations 