TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Prediction	BAIR Robot Pushing	MAGVIT (-B-FP)	FVD	76±0.1	# 2
Video Generation	BAIR Robot Pushing	MAGVIT	FVD score	62	# 1
Video Generation	BAIR Robot Pushing	MAGVIT	Cond	1	# 1
Video Generation	BAIR Robot Pushing	MAGVIT	Pred	15	# 8
Video Generation	BAIR Robot Pushing	MAGVIT	Train	15	# 2
Video Prediction	BAIR Robot Pushing	MAGVIT (-L-FP)	FVD	62±0.1	# 1
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-B-FP)	FVD	24.5±0.9	# 7
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-B-FP)	Cond	5	# 2
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-B-FP)	Pred	11	# 2
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-L-FP)	FVD	9.9±0.3	# 3
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-L-FP)	Cond	5	# 2
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-L-FP)	Pred	11	# 2
Video Generation	Kinetics-600 12 frames, 64x64	MAGVIT	FVD	9.9	# 3
Video Prediction	Something-Something V2	MAGVIT	FVD	28.5	# 1
Text-to-Video Generation	Something-Something V2	MAGVIT	FVD	79.1	# 1
Video Generation	UCF-101	MAGVIT (-L-CG, 128x128, class-conditional)	Inception Score	89.27±0.15	# 1
Video Generation	UCF-101	MAGVIT (-L-CG, 128x128, class-conditional)	FVD16	76±2	# 3
Video Generation	UCF-101	MAGVIT (-B-CG, 128x128, class-conditional)	Inception Score	83.55±0.14	# 2
Video Generation	UCF-101	MAGVIT (-B-CG, 128x128, class-conditional)	FVD16	159±2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-generation-on-bair-robot-pushing)](https://paperswithcode.com/sota/video-generation-on-bair-robot-pushing?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-prediction-on-bair-robot-pushing-1)](https://paperswithcode.com/sota/video-prediction-on-bair-robot-pushing-1?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-prediction-on-something-something-v2)](https://paperswithcode.com/sota/video-prediction-on-something-something-v2?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/text-to-video-generation-on-something)](https://paperswithcode.com/sota/text-to-video-generation-on-something?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-prediction-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-prediction-on-kinetics-600-12-frames?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-generation-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-generation-on-kinetics-600-12-frames?p=magvit-masked-generative-video-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=magvit-masked-generative-video-transformer)`

MAGVIT: Masked Generative Video Transformer

CVPR 2023 · Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang ·

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

google-research/magvit official

845

Tasks

Add Remove

Multi-Task Learning

Text-to-Video Generation

Video Generation

Video Prediction

Datasets

UCF101

nuScenes

Kinetics

Something-Something V2

Kinetics-600

Objectron BAIR Robot Pushing

Results from the Paper

Edit

Ranked #1 on Video Prediction on Something-Something V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Prediction	BAIR Robot Pushing	MAGVIT (-B-FP)	FVD	76±0.1	# 2	Compare
Video Generation	BAIR Robot Pushing	MAGVIT	FVD score	62	# 1	Compare
			Cond	1	# 1	Compare
			Pred	15	# 8	Compare
			Train	15	# 2	Compare
Video Prediction	BAIR Robot Pushing	MAGVIT (-L-FP)	FVD	62±0.1	# 1	Compare
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-B-FP)	FVD	24.5±0.9	# 7	Compare
			Cond	5	# 2	Compare
			Pred	11	# 2	Compare
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT (-L-FP)	FVD	9.9±0.3	# 3	Compare
			Cond	5	# 2	Compare
			Pred	11	# 2	Compare
Video Generation	Kinetics-600 12 frames, 64x64	MAGVIT	FVD	9.9	# 3	Compare
Video Prediction	Something-Something V2	MAGVIT	FVD	28.5	# 1	Compare
Text-to-Video Generation	Something-Something V2	MAGVIT	FVD	79.1	# 1	Compare
Video Generation	UCF-101	MAGVIT (-L-CG, 128x128, class-conditional)	Inception Score	89.27±0.15	# 1	Compare
Video Generation	UCF-101	MAGVIT (-L-CG, 128x128, class-conditional)	FVD16	76±2	# 3	Compare
Video Generation	UCF-101	MAGVIT (-B-CG, 128x128, class-conditional)	Inception Score	83.55±0.14	# 2	Compare
Video Generation	UCF-101	MAGVIT (-B-CG, 128x128, class-conditional)	FVD16	159±2	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Diffusion • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

MAGVIT: Masked Generative Video Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove