Attention-based Feature Aggregation

29 Sep 2021  ·  Xiongwei Wu, Ee-Peng Lim, Steven Hoi, Qianru Sun ·

Capturing object instances in different scales is a long-standing problem in the tasks of visual recognition, e.g., object detection and instance segmentation. The conventional way is to learn scale-invariant features, e.g., by summing up the feature maps output by different layers in the backbone. In this paper, we propose a novel and adaptive feature aggregation module based on attention where the attention parameters can be learned to handle different situations, e.g., adding shallow layers is learned to be conservative to mitigate the effect of noisy pixels, while for deep layers, it tends to be audacious to incorporate high-level semantics. To implement this module, we define two variants of attention: self-attention on the summed-up feature map, and cross-attention between two feature maps before summed up. The former uses the aggregated pixel values to capture global attention (to improve the feature for the next layer of aggregation), while the latter allows attention-based interactions between two features before aggregation. In addition, we apply multi-scale pooling in our attention module to reduce computational costs, and thus call the two variants Multi-Scale Self-Attention (MSSA) and Multi-Scale Cross-Attention (MSCA), respectively. We incorporate each variant into multiple baselines, e.g., the state-of-the-art object recognizer Cascade Mask-RCNN, and evaluate them on MSCOCO and LVIS datasets. Results show our significant improvements over baselines, e.g., boosting Cascade Mask-RCNN by 2.2% for AP^box and 2.7% for AP^mask on the MSCOCO dataset.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here