MAGVIT: Masked Generative Video Transformer

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Prediction BAIR Robot Pushing MAGVIT (-B-FP) FVD 76Β±0.1 # 2
Video Generation BAIR Robot Pushing MAGVIT FVD score 62 # 1
Cond 1 # 1
Pred 15 # 8
Train 15 # 2
Video Prediction BAIR Robot Pushing MAGVIT (-L-FP) FVD 62Β±0.1 # 1
Video Prediction Kinetics-600 12 frames, 64x64 MAGVIT (-B-FP) FVD 24.5Β±0.9 # 7
Cond 5 # 2
Pred 11 # 2
Video Prediction Kinetics-600 12 frames, 64x64 MAGVIT (-L-FP) FVD 9.9Β±0.3 # 3
Cond 5 # 2
Pred 11 # 2
Video Generation Kinetics-600 12 frames, 64x64 MAGVIT FVD 9.9 # 3
Video Prediction Something-Something V2 MAGVIT FVD 28.5 # 1
Text-to-Video Generation Something-Something V2 MAGVIT FVD 79.1 # 1
Video Generation UCF-101 MAGVIT (-L-CG, 128x128, class-conditional) Inception Score 89.27Β±0.15 # 1
FVD16 76Β±2 # 3
Video Generation UCF-101 MAGVIT (-B-CG, 128x128, class-conditional) Inception Score 83.55Β±0.14 # 2
FVD16 159Β±2 # 6

Methods