MotionBERT: A Unified Perspective on Learning Human Motion Representations

We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations. Code and models are available at https://motionbert.github.io/

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


 Ranked #1 on Monocular 3D Human Pose Estimation on Human3.6M (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
3D Human Pose Estimation 3DPW MotionBERT (Finetune) PA-MPJPE 47.2 # 46
MPJPE 76.9 # 49
MPVPE 88.1 # 37
3D Human Pose Estimation 3DPW MotionBERT-HybrIK PA-MPJPE 40.6 # 14
MPJPE 68.8 # 19
MPVPE 79.4 # 14
Monocular 3D Human Pose Estimation Human3.6M MotionBERT (Scratch) Average MPJPE (mm) 39.2 # 5
Use Video Sequence Yes # 1
Frames Needed 243 # 33
Need Ground Truth 2D Pose No # 1
2D detector SH # 1
3D Human Pose Estimation Human3.6M MotionBERT (Finetune) Average MPJPE (mm) 16.9 # 2
Using 2D ground-truth joints Yes # 2
Multi-View or Monocular Monocular # 1
Monocular 3D Human Pose Estimation Human3.6M MotionBERT (Finetune) Average MPJPE (mm) 37.5 # 2
Use Video Sequence Yes # 1
Frames Needed 243 # 33
Need Ground Truth 2D Pose No # 1
2D detector SH # 1
Skeleton Based Action Recognition NTU RGB+D MotionBert (finetune) Accuracy (CV) 97.2 # 9
Accuracy (CS) 93.0 # 11
One-Shot 3D Action Recognition NTU RGB+D 120 MotionBERT (Finetune) Accuracy 67.4% # 1

Methods