Text-to-Image Generation

203 papers with code • 10 benchmarks • 17 datasets

Text-to-Image Generation is a task in computer vision and natural language processing where the goal is to generate an image that corresponds to a given textual description. This involves converting the text input into a meaningful representation, such as a feature vector, and then using this representation to generate an image that matches the description.


Use these libraries to find Text-to-Image Generation models and implementations

Most implemented papers

Show and Tell: A Neural Image Caption Generator

karpathy/neuraltalk CVPR 2015

Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.

Generative Adversarial Text to Image Synthesis

reedscot/icml2016 17 May 2016

Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal.

High-Resolution Image Synthesis with Latent Diffusion Models

compvis/stable-diffusion CVPR 2022

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

hanzhanggit/StackGAN ICCV 2017

Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications.

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

taoxugit/AttnGAN CVPR 2018

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

hanzhanggit/StackGAN 19 Oct 2017

In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) aiming at generating high-resolution photo-realistic images.

Taming Transformers for High-Resolution Image Synthesis

CompVis/taming-transformers CVPR 2021

We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

Zero-Shot Text-to-Image Generation

openai/DALL-E 24 Feb 2021

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset.

Hierarchical Text-Conditional Image Generation with CLIP Latents

lucidrains/DALLE2-pytorch 13 Apr 2022

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.

MaskGIT: Masked Generative Image Transformer

google-research/maskgit CVPR 2022

At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation.