Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation

We present a lightweight solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. Building upon recent advances in interpretable representation learning, we exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. This allows us to reason effectively about 3D pose across different views without using compute-intensive volumetric grids. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections, that can be simply lifted to 3D via a differentiable Direct Linear Transform (DLT) layer. In order to do it efficiently, we propose a novel implementation of DLT that is orders of magnitude faster on GPU architectures than standard SVD-based triangulation methods. We evaluate our approach on two large-scale human pose datasets (H36M and Total Capture): our method outperforms or performs comparably to the state-of-the-art volumetric methods, while, unlike them, yielding real-time performance.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
3D Human Pose Estimation Human3.6M LWCDR (extra train data) Average MPJPE (mm) 21.0 # 11
Using 2D ground-truth joints No # 2
Multi-View or Monocular Multi-View # 1
3D Human Pose Estimation Human3.6M LWCDR Average MPJPE (mm) 30.2 # 33
Using 2D ground-truth joints No # 2
Multi-View or Monocular Multi-View # 1
3D Human Pose Estimation Total Capture LWCDR Average MPJPE (mm) 27.5 # 4

Methods


No methods listed for this paper. Add relevant methods here