ASFF Explained | Papers With Code

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**ASFF**, or **Adaptively Spatial Feature Fusion**, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features.

ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, *i.e.*, some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.

Let $\mathbf{x}_{ij}^{n\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:

$$
\mathbf{y}\_{ij}^l = \alpha^l_{ij} \cdot \mathbf{x}\_{ij}^{1\rightarrow l} + \beta^l_{ij} \cdot \mathbf{x}\_{ij}^{2\rightarrow l} +\gamma^l\_{ij} \cdot \mathbf{x}\_{ij}^{3\rightarrow l},
$$

where $\mathbf{y}\_{ij}^l$ implies the $(i,j)$-th vector of the output feature maps $\mathbf{y}^l$ among channels. $\alpha^l\_{ij}$, $\beta^l\_{ij}$ and $\gamma^l\_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\alpha^l\_{ij}$, $\beta^l\_{ij}$ and $\gamma^l\_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\alpha^l\_{ij}+\beta^l\_{ij}+\gamma^l\_{ij}=1$ and $\alpha^l\_{ij},\beta^l\_{ij},\gamma^l\_{ij} \in [0,1]$, and

$$
	\alpha^l_{ij} = \frac{e^{\lambda^l\_{\alpha\_{ij}}}}{e^{\lambda^l\_{\alpha_{ij}}} + e^{\lambda^l\_{\beta_{ij}
		}} + e^{\lambda^l\_{\gamma_{ij}}}}.
$$

Here $\alpha^l\_{ij}$, $\beta^l\_{ij}$ and $\gamma^l\_{ij}$ are defined by using the [softmax](https://paperswithcode.com/method/softmax) function with $\lambda^l\_{\alpha_{ij}}$, $\lambda^l\_{\beta_{ij}}$ and $\lambda^l\_{\gamma_{ij}}$ as control parameters respectively. We use $1\times1$ [convolution](https://paperswithcode.com/method/convolution) layers to compute the weight scalar maps $\mathbf{\lambda}^l_\alpha$, $\mathbf{\lambda}^l\_\beta$ and $\mathbf{\lambda}^l\_\gamma$ from $\mathbf{x}^{1\rightarrow l}$, $\mathbf{x}^{2\rightarrow l}$ and $\mathbf{x}^{3\rightarrow l}$ respectively, and they can thus be learned through standard back-propagation.

With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of [YOLOv3](https://paperswithcode.com/method/yolov3).

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2020-06-24_at_1.06.34_PM.png Clear
Change:

Attached collections:

FEATURE PYRAMID BLOCKS

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Component	Type	Add Remove
1x1 Convolution	Convolutions
Convolution	Convolutions
Max Pooling	Pooling Operations

Adaptively Spatial Feature Fusion

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove