Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation

22 Oct 2021  ·  Ziwen Li, Bo Xu, Han Huang, Cheng Lu, Yandong Guo ·

Several video-based 3D pose and shape estimation algorithms have been proposed to resolve the temporal inconsistency of single-image-based methods. However it still remains challenging to have stable and accurate reconstruction. In this paper, we propose a new framework Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation (DTS-VIBE), to generate 3D human pose and mesh from RGB videos. We reformulate the task as a multi-modality problem that fuses RGB and optical flow for more reliable estimation. In order to fully utilize both sensory modalities (RGB or optical flow), we train a two-stream temporal network based on transformer to predict SMPL parameters. The supplementary modality, optical flow, helps to maintain temporal consistency by leveraging motion knowledge between two consecutive frames. The proposed algorithm is extensively evaluated on the Human3.6 and 3DPW datasets. The experimental results show that it outperforms other state-of-the-art methods by a significant margin.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Human Pose Estimation 3DPW DST-VIBE PA-MPJPE 50.3 # 61
MPJPE 76.7 # 53
MPVPE 93.5 # 50
Acceleration Error 11 # 13
3D Human Pose Estimation Human3.6M DST-VIBE Average MPJPE (mm) 60.5 # 263
PA-MPJPE 39.3 # 62
Acceleration Error 5 # 8
3D Human Pose Estimation MPI-INF-3DHP DST-VIBE MPJPE 93.4 # 53
PA-MPJPE 62.2 # 9
Acceleration Error 11.9 # 11

Methods


No methods listed for this paper. Add relevant methods here