ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

Human pose estimation has achieved significant progress in recent years. However, most of the recent methods focus on improving accuracy using complicated models and ignoring real-time efficiency. To achieve a better trade-off between accuracy and efficiency, we propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation. In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions. In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames. To the best of our knowledge, we are the first to search for the temporal feature fusion and automatic computation allocation in videos. Extensive experiments demonstrate the effectiveness of our approach on the challenging COCO2017 and PoseTrack2018 datasets. Our discovered model family, S-ViPNAS and T-ViPNAS, achieve significantly higher inference speed (CPU real-time) without sacrificing the accuracy compared to the previous state-of-the-art methods.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Results from the Paper


Ranked #23 on Pose Estimation on COCO test-dev (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Pose Estimation COCO test-dev S-ViPNAS-HRNetW32 AP 73.9 # 23
AP50 91.7 # 20
AP75 82 # 19
APL 79.5 # 17
APM 70.5 # 20
AR 80.4 # 17
Pose Estimation COCO test-dev S-ViPNAS-Res50 AP 70.3 # 31
AP50 90.7 # 27
AP75 78.8 # 26
APL 75.5 # 27
APM 67.3 # 25
AR 77.3 # 24

Methods


No methods listed for this paper. Add relevant methods here