A generic diffusion-based approach for 3D human pose prediction in the wild

Predicting 3D human poses in real-world scenarios, also known as human pose forecasting, is inevitably subject to noisy inputs arising from inaccurate 3D pose estimations and occlusions. To address these challenges, we propose a diffusion-based approach that can predict given noisy observations. We frame the prediction task as a denoising problem, where both observation and prediction are considered as a single sequence containing missing elements (whether in the observation or prediction horizon). All missing elements are treated as noise and denoised with our conditional diffusion model. To better handle long-term forecasting horizon, we present a temporal cascaded diffusion model. We demonstrate the benefits of our approach on four publicly available datasets (Human3.6M, HumanEva-I, AMASS, and 3DPW), outperforming the state-of-the-art. Additionally, we show that our framework is generic enough to improve any 3D pose prediction model as a pre-processing step to repair their inputs and a post-processing step to refine their outputs. The code is available online: \url{https://github.com/vita-epfl/DePOSit}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Human Pose Forecasting 3DPW TCD FDE@560ms (mm) 55.4 # 1
FDE@720ms (mm) 61.6 # 1
FDE@880ms (mm) 67.9 # 1
FDE@1000ms (mm) 73.4 # 1
Human Pose Forecasting AMASS TCD FDE@560ms (mm) 49.8 # 1
FDE@720ms (mm) 54.5 # 1
FDE@880ms (mm) 60.1 # 1
FDE@1000ms (mm) 66.7 # 1
Human Pose Forecasting Human3.6M TCD APD 19466 # 1
ADE 356 # 1
FDE 396 # 1
MMADE 463 # 2
MMFDE 445 # 1
Human Pose Forecasting HumanEva-I TCD APD@2000ms 6764 # 1
ADE@2000ms 199 # 1
FDE@2000ms 215 # 1

Methods