The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term.
In this paper, we propose a diffusion-based face swapping framework for the first time, called DiffFace, composed of training ID conditional DDPM, sampling with facial guidance, and a target-preserving blending.
Existing pipelines of semantic correspondence commonly include extracting high-level semantic features for the invariance against intra-class variations and background clutters.
However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias.
Ranked #1 on Semantic correspondence on PF-WILLOW
We introduce a novel cost aggregation network, dubbed Volumetric Aggregation with Transformers (VAT), to tackle the few-shot segmentation task by using both convolutions and transformers to efficiently handle high dimensional correlation maps between query and support.
Ranked #2 on Semantic correspondence on PF-WILLOW