MaskGIT: Masked Generative Image Transformer

Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Generation ImageNet 256x256 MaskGIT (a=0.05) FID 4.02 # 47
Image Generation ImageNet 256x256 MaskGIT FID 6.18 # 56
Image Generation ImageNet 512x512 MaskGIT (a=0.05) FID 4.46 # 31
Inception score 342.0 # 2
Image Generation ImageNet 512x512 MaskGIT FID 7.32 # 33
Inception score 156.0 # 12
Text-to-Image Generation LHQC MaskGIT Block-FID 24.33 # 2
Image Outpainting LHQC MaskGIT Block-FID (Right Extend) 14.68 # 3
Block-FID (Left Extend) 14.81 # 3
Block-FID (Down Extend) 25.57 # 3
Block-FID (Up Extend) 25.38 # 3

Methods


No methods listed for this paper. Add relevant methods here