We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos.
Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time.
Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame.
Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence.
A central aspect is unsupervised learning of posture and behaviour representations to enable an objective comparison of movement.