Lifting Transformer for 3D Human Pose Estimation in Video

26 Mar 2021  ·  Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang ·

Despite great progress in video-based 3D human pose estimation, it is still challenging to learn a discriminative single-pose representation from redundant sequences. To this end, we propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation to lift a sequence of 2D joint locations to a 3D pose... Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE. STE not only significantly reduces the computation cost but also effectively aggregates information to a single-vector representation in a global and local fashion. Moreover, a full-to-single supervision scheme is employed at both the full sequence scale and single target frame scale, applying to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision. The proposed architecture is evaluated on two challenging benchmark datasets, namely, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters. read more

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Human Pose Estimation Human3.6M Lifting Transformer (T=243 CPN) Average MPJPE (mm) 29.1 # 5
Using 2D ground-truth joints yes # 1
3D Human Pose Estimation Human3.6M Lifting Transforme (T=243 CPN, protocol 1) Average MPJPE (mm) 44.7 # 18
3D Human Pose Estimation Human3.6M Lifting Transformer (T=243 CPN, protocol 2) Average MPJPE (mm) 36.1 # 8
3D Human Pose Estimation HumanEva-I Lifting Transformer (T=27 GT) Mean Reconstruction Error (mm) 12.2 # 1