The recent success in human action recognition with deep learning methods
mostly adopt the supervised learning paradigm, which requires significant
amount of manually labeled data to achieve good performance. However, label
collection is an expensive and time-consuming process...
In this work, we propose
an unsupervised learning framework, which exploits unlabeled data to learn
video representations. Different from previous works in video representation
learning, our unsupervised learning task is to predict 3D motion in multiple
target views using video representation from a source view. By learning to
extrapolate cross-view motions, the representation can capture view-invariant
motion dynamics which is discriminative for the action. In addition, we propose
a view-adversarial training method to enhance learning of view-invariant
features. We demonstrate the effectiveness of the learned representations for
action recognition on multiple datasets.