MogaNet: Multi-order Gated Aggregation Network

By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on \textit{multi-order game-theoretic interaction} within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D\&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0\% and 87.8\% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59\% FLOPs and 17M parameters, respectively. The source code is available at \url{https://github.com/Westlake-AI/MogaNet}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K MogaNet-L (UperNet) Validation mIoU 50.9 # 100
GFLOPs (512 x 512) 1176 # 21
Semantic Segmentation ADE20K MogaNet-S (Semantic FPN) Validation mIoU 47.7 # 150
GFLOPs (512 x 512) 189 # 8
Semantic Segmentation ADE20K MogaNet-S (UperNet) Validation mIoU 49.2 # 128
GFLOPs (512 x 512) 946 # 13
Semantic Segmentation ADE20K MogaNet-B (UperNet) Validation mIoU 50.1 # 113
GFLOPs (512 x 512) 1050 # 17
Semantic Segmentation ADE20K MogaNet-XL (UperNet) Validation mIoU 54 # 63
Object Detection COCO 2017 val MogaNet-XL (Cascade Mask R-CNN) AP 56.2 # 4
Object Detection COCO 2017 val MogaNet-T (Mask R-CNN 1x) AP 42.6 # 21
Object Detection COCO 2017 val MogaNet-XT (Mask R-CNN 1x) AP 40.7 # 23
Object Detection COCO 2017 val MogaNet-T (RetinaNet 1x) AP 41.4 # 22
Object Detection COCO 2017 val MogaNet-XT (RetinaNet 1x) AP 39.7 # 25
Object Detection COCO 2017 val MogaNet-L (Cascade Mask R-CNN) AP 53.3 # 5
Object Detection COCO 2017 val MogaNet-B (Cascade Mask R-CNN) AP 52.6 # 6
Object Detection COCO 2017 val MogaNet-S (Cascade Mask R-CNN) AP 51.6 # 7
Object Detection COCO 2017 val MogaNet-L (Mask R-CNN 1x) AP 49.4 # 11
Object Detection COCO 2017 val MogaNet-B (Mask R-CNN 1x) AP 47.9 # 15
Object Detection COCO 2017 val MogaNet-S (Mask R-CNN 1x) AP 46.7 # 18
Object Detection COCO 2017 val MogaNet-L (RetinaNet 1x) AP 48.7 # 14
Object Detection COCO 2017 val MogaNet-B (RetinaNet 1x) AP 47.7 # 16
Object Detection COCO 2017 val MogaNet-S (RetinaNet 1x) AP 45.8 # 19
Instance Segmentation COCO test-dev MogaNet-B (Mask R-CNN 1x) mask AP 43.2 # 48
Instance Segmentation COCO test-dev MogaNet-XL (Cascade Mask R-CNN) mask AP 48.8 # 24
Instance Segmentation COCO test-dev MogaNet-L (Cascade Mask R-CNN) mask AP 46.1 # 36
Instance Segmentation COCO test-dev MogaNet-B (Cascade Mask R-CNN) mask AP 46 # 38
Instance Segmentation COCO test-dev MogaNet-S (Cascade Mask R-CNN) mask AP 45.1 # 41
Instance Segmentation COCO test-dev MogaNet-L (Mask R-CNN 1x) mask AP 44.1 # 45
Instance Segmentation COCO test-dev MogaNet-S (Mask R-CNN 1x) mask AP 42.2 # 52
Instance Segmentation COCO test-dev MogaNet-T (Mask R-CNN 1x) mask AP 39.1 # 80
Instance Segmentation COCO test-dev MogaNet-XT mask AP 37.6 # 89
Instance Segmentation COCO test-dev MogaNet-T mask AP 35.8 # 95
Pose Estimation COCO val2017 MogaNet-T (256x192) AP 73.2 # 4
AP50 90.1 # 3
AP75 81 # 3
AR 78.8 # 4
Instance Segmentation COCO val2017 MogaNet-S (256x192) AP50 90.7 # 1
AP75 82.8 # 1
Pose Estimation COCO val2017 MogaNet-S (384x288) AP 76.4 # 2
AP50 91 # 2
AP75 83.3 # 2
AR 81.4 # 2
Pose Estimation COCO val2017 MogaNet-B (384x288) AP 77.3 # 1
AP50 91.4 # 1
AP75 84 # 1
AR 82.2 # 1
Pose Estimation COCO val2017 MogaNet-S (256x192) AP 74.9 # 3
AR 80.1 # 3
Image Classification ImageNet MogaNet-B Top 1 Accuracy 84.3% # 305
Number of params 44M # 698
GFLOPs 9.9 # 296
Image Classification ImageNet MogaNet-T (256res) Top 1 Accuracy 80% # 664
Number of params 5.2M # 410
GFLOPs 1.44 # 131
Image Classification ImageNet MogaNet-S Top 1 Accuracy 83.4% # 394
Number of params 25M # 587
GFLOPs 5 # 231
Image Classification ImageNet MogaNet-XL (384res) Top 1 Accuracy 87.8% # 75
Number of params 181M # 885
GFLOPs 102 # 450
Image Classification ImageNet MogaNet-XT (256res) Top 1 Accuracy 77.2% # 813
Number of params 3M # 366
GFLOPs 1.04 # 106
Image Classification ImageNet MogaNet-L Top 1 Accuracy 84.7% # 281
Number of params 83M # 810
GFLOPs 15.9 # 345
Video Prediction Moving MNIST Uniformer (SimVP 10x) MSE 18.01 # 8
MAE 57.52 # 7
Video Prediction Moving MNIST HorNet (SimVP 10x) MSE 17.4 # 5
MAE 55.7 # 5
SSIM 0.9624 # 6
Video Prediction Moving MNIST VAN (SimVP 10x) MSE 16.21 # 4
MAE 53.57 # 4
SSIM 0.9646 # 5
Video Prediction Moving MNIST Poolformer (SimVP 10x) MSE 20.96 # 13
MAE 64.31 # 12
Video Prediction Moving MNIST ConvMixer (SimVP 10x) MSE 22.3 # 14
MAE 67.37 # 13
Video Prediction Moving MNIST MLP-Mixer (SimVP 10x) MSE 18.85 # 9
MAE 59.86 # 9
Video Prediction Moving MNIST Swin (SimVP 10x) MSE 19.11 # 10
MAE 59.84 # 8
Video Prediction Moving MNIST ViT (SimVP 10x) MSE 19.74 # 11
MAE 61.65 # 11
SSIM 0.9539 # 10
Video Prediction Moving MNIST ConvNeXt (SimVP 10x) MSE 17.58 # 6
MAE 55.76 # 6
SSIM 0.9617 # 8
Video Prediction Moving MNIST MogaNet (SimVP 10x) MSE 15.67 # 3
MAE 51.84 # 3
SSIM 0.9661 # 3

Methods