PA3D: Pose-Action 3D Machine for Video Recognition

CVPR 2019  ·  An Yan, Yali Wang, Zhifeng Li, Yu Qiao ·

Recent studies have witnessed the successes of using 3D CNNs for video action recognition. However, most 3D models are built upon RGB and optical flow streams, which may not fully exploit pose dynamics, i.e., an important cue of modeling human actions. To fill this gap, we propose a concise Pose-Action 3D Machine (PA3D), which can effectively encode multiple pose modalities within a unified 3D framework, and consequently learn spatio-temporal pose representations for action recognition. More specifically, we introduce a novel temporal pose convolution to aggregate spatial poses over frames. Unlike the classical temporal convolution, our operation can explicitly learn the pose motions that are discriminative to recognize human actions. Extensive experiments on three popular benchmarks (i.e., JHMDB, HMDB, and Charades) show that, PA3D outperforms the recent pose-based approaches. Furthermore, PA3D is highly complementary to the recent 3D CNNs, e.g., I3D. Multi-stream fusion achieves the state-of-the-art performance on all evaluated data sets.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Classification Charades PA3D + (GCN + I3D + NL I3D) MAP 41 # 31
Skeleton Based Action Recognition J-HMDB PA3D Accuracy (RGB+pose) 69.5 # 8
Skeleton Based Action Recognition J-HMDB PA3D+RPAN Accuracy (RGB+pose) 86.1 # 2

Methods