TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Prediction	Cityscapes 128x128	GHVAEs	FVD	418.00 ± 5.0	# 1
Video Prediction	Cityscapes 128x128	GHVAEs	SSIM	0.740±0.4	# 1
Video Prediction	Cityscapes 128x128	GHVAEs	LPIPS	0.193 ± 0.014	# 1
Video Prediction	Cityscapes 128x128	GHVAEs	Cond.	2	# 1
Video Prediction	Cityscapes 128x128	GHVAEs	Pred	28	# 3
Video Prediction	Cityscapes 128x128	GHVAEs	Train	10	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/greedy-hierarchical-variational-autoencoders/video-prediction-on-cityscapes-128x128)](https://paperswithcode.com/sota/video-prediction-on-cityscapes-128x128?p=greedy-hierarchical-variational-autoencoders)`

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

CVPR 2021 · Bohan Wu, Suraj Nair, Roberto Martin-Martin, Li Fei-Fei, Chelsea Finn ·

A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Video Prediction

Datasets

Cityscapes

Human3.6M

Results from the Paper

Add Remove

Ranked #1 on Video Prediction on Cityscapes 128x128

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Prediction	Cityscapes 128x128	GHVAEs	FVD	418.00 ± 5.0	# 1	Compare
			SSIM	0.740±0.4	# 1	Compare
			LPIPS	0.193 ± 0.014	# 1	Compare
			Cond.	2	# 1	Compare
			Pred	28	# 3	Compare
			Train	10	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove