TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Generation	BAIR Robot Pushing	Video Transformer	FVD score	94± 2	# 7
Video Generation	BAIR Robot Pushing	Video Transformer	Cond	1	# 1
Video Generation	BAIR Robot Pushing	Video Transformer	Pred	15	# 8
Video Generation	BAIR Robot Pushing	Video Transformer	Train	15	# 2
Video Generation	BAIR Robot Pushing	Video Transformer	Notes	FVD on only leftmost samples is 94, FVD on unrolled (all subsequences) is 96	# 1
Video Prediction	Kinetics-600 12 frames, 64x64	Video Transformer	FVD	170±5	# 12
Video Prediction	Kinetics-600 12 frames, 64x64	Video Transformer	Cond	5	# 2
Video Prediction	Kinetics-600 12 frames, 64x64	Video Transformer	Pred	11	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-autoregressive-video-models/video-generation-on-bair-robot-pushing)](https://paperswithcode.com/sota/video-generation-on-bair-robot-pushing?p=scaling-autoregressive-video-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-autoregressive-video-models/video-prediction-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-prediction-on-kinetics-600-12-frames?p=scaling-autoregressive-video-models)`

Scaling Autoregressive Video Models

ICLR 2020 · Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit ·

Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models often attempt to address these issues by combining sometimes complex, usually video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism. We also present results from training our models on Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. While modeling these phenomena consistently remains elusive, we hope that our results, which include occasional realistic continuations encourage further research on comparatively complex, large scale datasets such as Kinetics.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Code

Add Remove Mark official

rakhimovv/lvt

Tasks

Add Remove

Action Recognition

Video Generation

Video Prediction

Datasets

Kinetics

Moving MNIST

Kinetics-600 BAIR Robot Pushing

Results from the Paper

Edit

Ranked #7 on Video Generation on BAIR Robot Pushing

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Generation	BAIR Robot Pushing	Video Transformer	FVD score	94± 2	# 7	Compare
			Cond	1	# 1	Compare
			Pred	15	# 8	Compare
			Train	15	# 2	Compare
			Notes	FVD on only leftmost samples is 94, FVD on unrolled (all subsequences) is 96	# 1	Compare
Video Prediction	Kinetics-600 12 frames, 64x64	Video Transformer	FVD	170±5	# 12	Compare
			Cond	5	# 2	Compare
			Pred	11	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Scaling Autoregressive Video Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove