A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment.
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e. g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video.
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics.
Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world.
In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks.
SOTA for Semantic Segmentation on CamVid (using extra training data)
However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging -- the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction.