Augmenting Convolutional networks with attention-based aggregation

We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semantic Segmentation ADE20K PatchConvNet-B60 (UperNet) Validation mIoU 51.1 # 48
Semantic Segmentation ADE20K PatchConvNet-B120 (UperNet) Validation mIoU 52.8 # 43
Semantic Segmentation ADE20K PatchConvNet-L120 (UperNet) Validation mIoU 52.9 # 41
Semantic Segmentation ADE20K PatchConvNet-S60 (UperNet) Validation mIoU 49.3 # 69
Semantic Segmentation ADE20K val PatchConvNet-B120 (UperNet) mIoU 52.8 # 28
Semantic Segmentation ADE20K val PatchConvNet-B60 (UperNet) mIoU 51.1 # 32
Semantic Segmentation ADE20K val PatchConvNet-S60 (UperNet) mIoU 49.3 # 44
Semantic Segmentation ADE20K val PatchConvNet-L120 (UperNet) mIoU 52.9 # 27
Object Detection COCO minival PatchConvNet-S120 (Mask R-CNN) box AP 47.0 # 67
Object Detection COCO minival PatchConvNet-S60 (Mask R-CNN) box AP 46.4 # 71
Image Classification ImageNet PatchConvNet-S60 Top 1 Accuracy 82.1% # 350
Number of params 25.2M # 417
Image Classification ImageNet PatchConvNet-L120-21k-384 Top 1 Accuracy 87.1% # 57
Number of params 334.3M # 672
Image Classification ImageNet PatchConvNet-B60-21k-384 Top 1 Accuracy 86.5% # 80
Number of params 99.4M # 626
Image Classification ImageNet PatchConvNet-B120 Top 1 Accuracy 84.1% # 198
Number of params 188.6M # 646
Image Classification ImageNet PatchConvNet-B60 Top 1 Accuracy 83.5% # 243
Number of params 99.4M # 626
Image Classification ImageNet PatchConvNet-S120 Top 1 Accuracy 83.2% # 262
Number of params 47.7M # 507
Image Classification ImageNet PatchConvNet-S60-21k-512 Top 1 Accuracy 85.4% # 133
Number of params 25.2M # 417

Methods