Collaborative Spatiotemporal Feature Learning for Video Action Recognition

CVPR 2019 Chao Li Qiaoyong Zhong Di Xie Shiliang Pu

Spatiotemporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D)... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
COMPARE
Action Classification Kinetics-400 CoST ResNet-101 (ImageNet pretrain) Accuracy 77.5 # 6
Action Classification Moments in Time CoST (ResNet-101, 32 frames) Top 1 Accuracy 32.4% # 1
Action Classification Moments in Time CoST (ResNet-101, 32 frames) Top 5 Accuracy 60.0% # 1