Another line of techniques expand the inversion space to learn multiple embeddings but they do this only along the layer dimension (e. g., one per layer of the DDPM model) or the timestep dimension (one for a set of timesteps in the denoising process), leading to suboptimal attribute disentanglement.
We hope our work would attract attention to this newly identified, pragmatic problem setting.
Recent works in self-supervised learning have shown impressive results on single-object images, but they struggle to perform well on complex multi-object images as evidenced by their poor visual grounding.
First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output.
In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal.