Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts.
The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size.
Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e. g. a specific face, and learns to map it into a word-embedding representing the concept.
Diffusion models have enabled high-quality, conditional image editing capabilities.
Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.
We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.
Ranked #4 on Zero-Shot Composed Image Retrieval (ZS-CIR) on CIRCO
Of these, StyleGAN offers a fascinating case study, owing to its remarkable visual quality and an ability to support a large array of downstream tasks.
We compare our models to a wide range of latent editing methods, and show that by alleviating the bias they achieve finer semantic control and better identity preservation through a wider range of transformations.
The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing.
Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image?
For modern generative frameworks, this semantic encoding manifests as smooth, linear directions which affect image attributes in a disentangled manner.
In recent years, considerable progress has been made in the visual quality of Generative Adversarial Networks (GANs).
Ranked #10 on Image Generation on FFHQ 1024 x 1024
Our network encourages disentangled generation of semantic parts via two key ingredients: a root-mixing training strategy which helps decorrelate the different branches to facilitate disentanglement, and a set of loss terms designed with part disentanglement and shape semantics in mind.