Text-to-Image Generation
287 papers with code • 11 benchmarks • 18 datasets
Text-to-Image Generation is a task in computer vision and natural language processing where the goal is to generate an image that corresponds to a given textual description. This involves converting the text input into a meaningful representation, such as a feature vector, and then using this representation to generate an image that matches the description.
Libraries
Use these libraries to find Text-to-Image Generation models and implementationsDatasets
Subtasks
Most implemented papers
MaskGIT: Masked Generative Image Transformer
At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation.
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions.
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis
If the initial image is not well initialized, the following processes can hardly refine the image to a satisfactory quality.
CogView: Mastering Text-to-Image Generation via Transformers
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.
A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces
Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity.
Muse: Text-To-Image Generation via Masked Generative Transformers
Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding.
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction
Conditional text-to-image generation is an active area of research, with many possible applications.
ManiGAN: Text-Guided Image Manipulation
The goal of our paper is to semantically edit parts of an image matching a given text that describes desired attributes (e. g., texture, colour, and background), while preserving other contents that are irrelevant to the text.
Conditional Image Generation and Manipulation for User-Specified Content
This can be done by conditioning the model on additional information.