A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment.
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e. g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video.
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics.
Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world.
However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging -- the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction.
The ability to predict and therefore to anticipate the future is an important attribute of intelligence.
In this work we introduce a new framework for performing temporal predictions in the presence of uncertainty.
We apply the "detection head'" of Mask R-CNN on the predicted features to produce the instance segmentation of future frames.
Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification.