HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHRNet outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all top-down methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/Higher-HRNet-Human-Pose-Estimation.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multi-Person Pose Estimation COCO test-dev HigherHRNet (HR-Net-48) AP 70.5 # 6
APL 75.8 # 4
APM 66.6 # 4
AP50 89.3 # 4
AP75 77.2 # 4
Multi-Person Pose Estimation CrowdPose HigherHRNet(HR-Net-48) mAP @0.5:0.95 67.6 # 11
AP Easy 75.8 # 9
AP Medium 68.1 # 10
AP Hard 58.9 # 9
FPS - # 3
Pose Estimation UAV-Human HigherHRNet mAP 56.5 # 2