Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training... We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Text-to-Image Generation COCO DALL-E FID-0 27.5 # 3
FID-1 28.0 # 3
FID-2 45.5 # 3
FID-4 83.5 # 3
FID-8 85.0 # 3
IS 17.9 # 3


No methods listed for this paper. Add relevant methods here