We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions.
Language-guided image editing has achieved great success recently.
With a semantic feature matching loss for effective semantic supervision, our sketch embedding precisely conveys the semantics in the input sketches to the synthesized images.
In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs.
In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches.