Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts.
The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size.
Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e. g. a specific face, and learns to map it into a word-embedding representing the concept.
Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.
We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.
Ranked #7 on Zero-Shot Composed Image Retrieval (ZS-CIR) on CIRCO
Reasoning and interacting with dynamic environments is a fundamental problem in AI, but it becomes extremely challenging when actions can trigger cascades of cross-dependent events.
Specifically, given birds' images with free-text descriptions of their species, we learn to classify images of previously-unseen species based on specie descriptions.
Real-world data is predominantly unbalanced and long-tailed, but deep models struggle to recognize rare classes in the presence of frequent classes.
Ranked #1 on Long-tail learning with class descriptors on CUB-LT
Here we describe a new approach to learn with fewer samples, by using additional information that is provided per sample.
Specifically, our model consists of three classifiers: A "gating" model that makes soft decisions if a sample is from a "seen" class, and two experts: a ZSL expert, and an expert model for seen classes.
The soft group structure can be learned from data jointly as part of the model, and can also readily incorporate prior knowledge about groups if available.
Recurrent neural networks have recently been used for learning to describe images using natural language.