Bootstrapped Representation Learning for Skeleton-Based Action Recognition

4 Feb 2022  ·  Olivier Moliner, Sangxia Huang, Kalle Åström ·

In this work, we study self-supervised representation learning for 3D skeleton-based action recognition. We extend Bootstrap Your Own Latent (BYOL) for representation learning on skeleton sequence data and propose a new data augmentation strategy including two asymmetric transformation pipelines. We also introduce a multi-viewpoint sampling method that leverages multiple viewing angles of the same action captured by different cameras. In the semi-supervised setting, we show that the performance can be further improved by knowledge distillation from wider networks, leveraging once more the unlabeled samples. We conduct extensive experiments on the NTU-60 and NTU-120 datasets to demonstrate the performance of our proposed method. Our method consistently outperforms the current state of the art on both linear evaluation and semi-supervised benchmarks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Skeleton Based Action Recognition NTU RGB+D BRL Accuracy (Cross-Subject) 86.8 # 1
Accuracy (Cross-View) 91.2 # 1
Unsupervised Skeleton Based Action Recognition NTU RGB+D 120 BRL Accuracy (Cross-Subject) 77.1 # 1
Accuracy (Cross-Setup) 79.2 # 1
Unsupervised Skeleton Based Action Recognition PKU-MMD BRL Accuracy (CS) 55.25 # 1