Augmenting Convolutional networks with attention-based aggregation

We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semantic Segmentation ADE20K PatchConvNet-S60 (UperNet) Validation mIoU 49.3 # 127
Semantic Segmentation ADE20K PatchConvNet-L120 (UperNet) Validation mIoU 52.9 # 78
Semantic Segmentation ADE20K PatchConvNet-B120 (UperNet) Validation mIoU 52.8 # 80
Semantic Segmentation ADE20K PatchConvNet-B60 (UperNet) Validation mIoU 51.1 # 93
Semantic Segmentation ADE20K val PatchConvNet-L120 (UperNet) mIoU 52.9 # 38
Semantic Segmentation ADE20K val PatchConvNet-B120 (UperNet) mIoU 52.8 # 39
Semantic Segmentation ADE20K val PatchConvNet-B60 (UperNet) mIoU 51.1 # 43
Semantic Segmentation ADE20K val PatchConvNet-S60 (UperNet) mIoU 49.3 # 55
Object Detection COCO minival PatchConvNet-S120 (Mask R-CNN) box AP 47.0 # 93
Object Detection COCO minival PatchConvNet-S60 (Mask R-CNN) box AP 46.4 # 97
Image Classification ImageNet PatchConvNet-B60-21k-384 Top 1 Accuracy 86.5% # 135
Number of params 99.4M # 866
Image Classification ImageNet PatchConvNet-B120 Top 1 Accuracy 84.1% # 325
Number of params 188.6M # 888
Image Classification ImageNet PatchConvNet-B60 Top 1 Accuracy 83.5% # 391
Number of params 99.4M # 866
Image Classification ImageNet PatchConvNet-S120 Top 1 Accuracy 83.2% # 413
Number of params 47.7M # 712
Image Classification ImageNet PatchConvNet-S60 Top 1 Accuracy 82.1% # 525
Number of params 25.2M # 593
Image Classification ImageNet PatchConvNet-S60-21k-512 Top 1 Accuracy 85.4% # 221
Number of params 25.2M # 593
Image Classification ImageNet PatchConvNet-L120-21k-384 Top 1 Accuracy 87.1% # 103
Number of params 334.3M # 920

Methods