Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

CVPR 2019 Jiangliu WangJianbo JiaoLinchao BaoShengfeng HeYunhui LiuWei Liu

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT LEADERBOARD
Action Recognition In Videos HMDB-51 Pretrained on Kinetics Average accuracy of 3 splits 33.4 # 26
Action Recognition In Videos UCF101 Pretrained on Kinetics 3-fold Accuracy 61.2 # 31
Self-Supervised Action Recognition UCF101 Motion & Appearance (C3D) 3-fold Accuracy 58.8 # 12