MogaNet: Multi-order Gated Aggregation Network

By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on \textit{multi-order game-theoretic interaction} within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D\&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0\% and 87.8\% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59\% FLOPs and 17M parameters, respectively. The source code is available at \url{https://github.com/Westlake-AI/MogaNet}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K MogaNet-XL (UperNet) Validation mIoU 54 # 66
Semantic Segmentation ADE20K MogaNet-L (UperNet) Validation mIoU 50.9 # 104
GFLOPs (512 x 512) 1176 # 22
Semantic Segmentation ADE20K MogaNet-S (Semantic FPN) Validation mIoU 47.7 # 155
GFLOPs (512 x 512) 189 # 8
Semantic Segmentation ADE20K MogaNet-S (UperNet) Validation mIoU 49.2 # 133
GFLOPs (512 x 512) 946 # 14
Semantic Segmentation ADE20K MogaNet-B (UperNet) Validation mIoU 50.1 # 117
GFLOPs (512 x 512) 1050 # 18
Object Detection COCO 2017 val MogaNet-L (Cascade Mask R-CNN) AP 53.3 # 10
Object Detection COCO 2017 val MogaNet-XT (Mask R-CNN 1x) AP 40.7 # 30
Object Detection COCO 2017 val MogaNet-T (RetinaNet 1x) AP 41.4 # 29
Object Detection COCO 2017 val MogaNet-XT (RetinaNet 1x) AP 39.7 # 32
Object Detection COCO 2017 val MogaNet-XL (Cascade Mask R-CNN) AP 56.2 # 9
Object Detection COCO 2017 val MogaNet-S (RetinaNet 1x) AP 45.8 # 26
Object Detection COCO 2017 val MogaNet-B (RetinaNet 1x) AP 47.7 # 23
Object Detection COCO 2017 val MogaNet-L (RetinaNet 1x) AP 48.7 # 21
Object Detection COCO 2017 val MogaNet-B (Mask R-CNN 1x) AP 47.9 # 22
Object Detection COCO 2017 val MogaNet-S (Mask R-CNN 1x) AP 46.7 # 25
Object Detection COCO 2017 val MogaNet-T (Mask R-CNN 1x) AP 42.6 # 28
Object Detection COCO 2017 val MogaNet-L (Mask R-CNN 1x) AP 49.4 # 18
Object Detection COCO 2017 val MogaNet-S (Cascade Mask R-CNN) AP 51.6 # 14
Object Detection COCO 2017 val MogaNet-B (Cascade Mask R-CNN) AP 52.6 # 11
Instance Segmentation COCO test-dev MogaNet-B (Cascade Mask R-CNN) mask AP 46 # 40
Instance Segmentation COCO test-dev MogaNet-L (Mask R-CNN 1x) mask AP 44.1 # 47
Instance Segmentation COCO test-dev MogaNet-S (Mask R-CNN 1x) mask AP 42.2 # 54
Instance Segmentation COCO test-dev MogaNet-T (Mask R-CNN 1x) mask AP 39.1 # 82
Instance Segmentation COCO test-dev MogaNet-XT mask AP 37.6 # 91
Instance Segmentation COCO test-dev MogaNet-T mask AP 35.8 # 97
Instance Segmentation COCO test-dev MogaNet-B (Mask R-CNN 1x) mask AP 43.2 # 50
Instance Segmentation COCO test-dev MogaNet-XL (Cascade Mask R-CNN) mask AP 48.8 # 26
Instance Segmentation COCO test-dev MogaNet-L (Cascade Mask R-CNN) mask AP 46.1 # 38
Instance Segmentation COCO test-dev MogaNet-S (Cascade Mask R-CNN) mask AP 45.1 # 43
Instance Segmentation COCO val2017 MogaNet-S (256x192) AP50 90.7 # 1
AP75 82.8 # 1
Pose Estimation COCO val2017 MogaNet-B (384x288) AP 77.3 # 2
AP50 91.4 # 3
AP75 84 # 3
AR 82.2 # 1
Pose Estimation COCO val2017 MogaNet-S (384x288) AP 76.4 # 4
AP50 91 # 4
AP75 83.3 # 4
AR 81.4 # 2
Pose Estimation COCO val2017 MogaNet-S (256x192) AP 74.9 # 8
AR 80.1 # 6
Pose Estimation COCO val2017 MogaNet-T (256x192) AP 73.2 # 9
AP50 90.1 # 6
AP75 81 # 6
AR 78.8 # 7
Image Classification ImageNet MogaNet-XL (384res) Top 1 Accuracy 87.8% # 68
Number of params 181M # 960
GFLOPs 102 # 499
Image Classification ImageNet MogaNet-S Top 1 Accuracy 83.4% # 425
Number of params 25M # 637
GFLOPs 5 # 241
Image Classification ImageNet MogaNet-T (256res) Top 1 Accuracy 80% # 722
Number of params 5.2M # 459
GFLOPs 1.44 # 133
Image Classification ImageNet MogaNet-B Top 1 Accuracy 84.3% # 324
Number of params 44M # 754
GFLOPs 9.9 # 318
Image Classification ImageNet MogaNet-L Top 1 Accuracy 84.7% # 294
Number of params 83M # 878
GFLOPs 15.9 # 373
Image Classification ImageNet MogaNet-XT (256res) Top 1 Accuracy 77.2% # 880
Number of params 3M # 415
GFLOPs 1.04 # 108
Video Prediction Moving MNIST MogaNet (SimVP 10x) MSE 15.67 # 4
MAE 51.84 # 4
SSIM 0.9661 # 4
Video Prediction Moving MNIST Uniformer (SimVP 10x) MSE 18.01 # 9
MAE 57.52 # 8
Video Prediction Moving MNIST Swin (SimVP 10x) MSE 19.11 # 11
MAE 59.84 # 9
Video Prediction Moving MNIST ConvNeXt (SimVP 10x) MSE 17.58 # 7
MAE 55.76 # 7
SSIM 0.9617 # 9
Video Prediction Moving MNIST ViT (SimVP 10x) MSE 19.74 # 12
MAE 61.65 # 12
SSIM 0.9539 # 11
Video Prediction Moving MNIST MLP-Mixer (SimVP 10x) MSE 18.85 # 10
MAE 59.86 # 10
Video Prediction Moving MNIST ConvMixer (SimVP 10x) MSE 22.3 # 15
MAE 67.37 # 14
Video Prediction Moving MNIST Poolformer (SimVP 10x) MSE 20.96 # 14
MAE 64.31 # 13
Video Prediction Moving MNIST VAN (SimVP 10x) MSE 16.21 # 5
MAE 53.57 # 5
SSIM 0.9646 # 6
Video Prediction Moving MNIST HorNet (SimVP 10x) MSE 17.4 # 6
MAE 55.7 # 6
SSIM 0.9624 # 7

Methods