Zero-Shot Text-to-Image Generation
11 papers with code • 0 benchmarks • 0 datasets
These leaderboards are used to track progress in Zero-Shot Text-to-Image Generation
LibrariesUse these libraries to find Zero-Shot Text-to-Image Generation models and implementations
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding.
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs.
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text.
Unlike the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion.