VIBE: Video Inference for Human Body Pose and Shape Estimation

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Human Pose Estimation 3DPW VIBE PA-MPJPE 51.9 # 19
MPJPE 82.9 # 18
MPVPE 99.1 # 12
Acceleration Error 23.4 # 11
Number of parameters (M) 72.43 # 3
3D Human Pose Estimation Human3.6M VIBE Average MPJPE (mm) 65.6 # 212
Using 2D ground-truth joints No # 1
Multi-View or Monocular Monocular # 1
PA-MPJPE 41.4 # 33
Monocular 3D Human Pose Estimation Human3.6M VIBE Average MPJPE (mm) 65.6 # 20
Use Video Sequence Yes # 1
Frames Needed 16 # 23
Need Ground Truth 2D Pose No # 1
3D Human Pose Estimation MPI-INF-3DHP VIBE MPJPE 96.6 # 30
PA-MPJPE 64.6 # 9
PCK 89.3 # 10