Improved Conditional VRNNs for Video Prediction

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Generation BAIR Robot Pushing VRNN 1L FVD score 149.22 # 18
Cond 2 # 13
SSIM 0.829±0.06 # 3
LPIPS 0.058±0.03 # 8
Pred 28 # 20
Train 10 # 23
Video Generation BAIR Robot Pushing Hier-VRNN FVD score 143.4 # 16
Cond 2 # 13
SSIM 0.822±0.06 # 4
LPIPS 0.055±0.03 # 10
Pred 28 # 20
Train 10 # 23
Video Prediction Cityscapes 128x128 Hier-VRNN FVD 567.51 # 2
SSIM 0.628±0.1 # 3
LPIPS 0.264 ± .07 # 2
Cond. 2 # 1
Pred 28 # 3
Train 10 # 1

Methods


No methods listed for this paper. Add relevant methods here