CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract


Results from the Paper

Ranked #22 on Text-to-Image Generation on COCO (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Text-to-Image Generation COCO CogView (zero-shot) FID 27.1 # 22
Inception score 18.2 # 14
FID-1 19.4 # 1
FID-8 23.6 # 3
FID-2 13.9 # 1
FID-4 19.4 # 2