VarNet: Exploring Variations for Unsupervised Video Prediction

Unsupervised video prediction is a very challenging task due to the complexity and diversity in natural scenes. Prior works directly predicting pixels or optical flows either have the blurring problem or require additional assumptions. We highlight that the crux for video frame prediction lies in precisely capturing the inter-frame variations which encompass the movement of objects and the evolution of the surrounding environment. We then present an unsupervised video prediction framework - Variation Network (VarNet) to directly predict the variations between adjacent frames which are then fused with current frame to generate the future frame. In addition, we propose an adaptively re-weighting mechanism for loss function to offer each pixel a fair weight according to the amplitude of its variation. Extensive experiments for both short-term and long-term video prediction are implemented on two advanced datasets - KTH and KITTI with two evaluating metrics - PSNR and SSIM. For the KTH dataset, the VarNet outperforms the state-of-the-art works up to 11.9% on PSNR and 9.5% on SSIM. As for the KITTI dataset, the performance boosts are up to 55.1% on PSNR and 15.9% on SSIM. Moreover, we verify that the generalization ability of our model excels other state-of-the-art methods by testing on the unseen CalTech Pedestrian dataset after being trained on the KITTI dataset. Source code and video are available at https://github.com/jinbeibei/VarNet.

PDF

Datasets


Results from the Paper


 Ranked #1 on Video Prediction on KTH (Cond metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Prediction KTH VarNet PSNR 28.48 # 5
SSIM 0.843 # 11
Cond 10 # 1
Pred 20 # 1

Methods


No methods listed for this paper. Add relevant methods here