Monocular Expressive Body Regression through Body-Driven Attention

To understand how people look, interact, or perform tasks, we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face- and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de .

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Datasets


Introduced in the Paper:

ExPose

Used in the Paper:

FFHQ 3DPW FreiHAND PIFu AGORA
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Human Pose Estimation 3DPW ExPose PA-MPJPE 60.7 # 108
MPJPE 93.4 # 102
3D Human Reconstruction AGORA ExPose FB-NMVE 265.0 # 3
B-NMVE 184.8 # 2
FB-NMJE 263.3 # 3
B-NMJE 183.4 # 2
FB-MVE 217.3 # 3
B-MVE 151.5 # 2
F-MVE 51.1 # 3
LH/RH-MVE 74.9/71.3 # 1
FB-MPJPE 215.9 # 3
B-MPJPE 150.4 # 2
F-MPJPE 55.2 # 3
LH/RH-MPJPE 72.5/68.8 # 1
3D Multi-Person Mesh Recovery AGORA ExPose FB-NMVE 265.0 # 6
B-NMVE 184.8 # 4
FB-NMJE 263.3 # 2
B-NMJE 183.4 # 2
FB-MVE 217.3 # 5
B-MVE 151.5 # 2
F-MVE 51.1 # 5
LH/RH-MVE 74.9/71.3 # 3
FB-MPJPE 215.9 # 2
B-MPJPE 150.4 # 2
F-MPJPE 55.2 # 3
LH/RH-MPJPE 72.5/68.8 # 1
3D Human Reconstruction Expressive hands and faces dataset (EHF) ExPose PA V2V (mm), whole body 54.5 # 2
PA V2V (mm), body only 52.6 # 1
PA V2V (mm), left hand 13.1 # 1
PA V2V (mm), face 5.8 # 3
TR V2V (mm), whole body 65.7 # 1
TR V2V (mm), body only 76.8 # 2
TR V2V (mm), left hand 31.2 # 3
TR V2V (mm), face 15.9 # 1
MPJPE-14 62.8 # 3
MPJPE, left hand 13.5 # 4
mean P2S 28.9 # 1
median P2S 18 # 1
3D Human Reconstruction Expressive hands and faces dataset (EHF). PA-V2V (mm) All 54.5 # 1
3D Hand Pose Estimation FreiHAND ExPose (hand sub-network h) PA-MPVPE 11.8 # 25
PA-MPJPE 12.2 # 25
PA-F@5mm 0.484 # 24
PA-F@15mm 0.918 # 25

Methods