Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks

Despite great progress in 3D pose estimation from single-view images or videos, it remains a challenging task due to the substantial depth ambiguity and severe self-occlusions. Motivated by the effectiveness of incorporating spatial dependencies and temporal consistencies to alleviate these issues, we propose a novel graph-based method to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. Particularly, domain knowledge about the human hand (body) configurations is explicitly incorporated into the graph convolutional operations to meet the specific demand of the 3D pose estimation. Furthermore, we introduce a local-to-global network architecture, which is capable of learning multi-scale features for the graph-based representations. We evaluate the proposed method on challenging benchmark datasets for both 3D hand pose estimation and 3D body pose estimation. Experimental results show that our method achieves state-of-the-art performance on both tasks.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
3D Human Pose Estimation Human3.6M STRGCN (T=3) Average MPJPE (mm) 49.1 # 150
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
3D Human Pose Estimation Human3.6M STRGCN (T=7) Average MPJPE (mm) 48.8 # 146
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M STRGCN (T=1) Average MPJPE (mm) 50.6 # 172
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1

Methods


No methods listed for this paper. Add relevant methods here