Image Model Blocks

Content-Conditioned Style Encoder

Introduced by Saito et al. in COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder

The Content-Conditioned Style Encoder, or COCO, is a style encoder used for image-to-image translation in the COCO-FUNIT architecture. Unlike the style encoder in FUNIT, COCO takes both content and style image as input. With this content conditioning scheme, we create a direct feedback path during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.

The bottom part of the Figure details architecture. First, the content image is fed into an encoder $E_{S, C}$ to compute a spatial feature map. This content feature map is then mean-pooled and mapped to a vector $\zeta_{c} .$ Similarly, the style image is fed into encoder $E_{S, S}$ to compute a spatial feature map. The style feature map is then mean-pooled and concatenated with an input-independent bias vector: the constant style bias (CSB). Note that while the regular bias in deep networks is added to the activations, in CSB, the bias is concatenated with the activations. The CSB provides a fixed input to the style encoder, which helps compute a style code that is less sensitive to the variations in the style image.

The concatenation of the style vector and the CSB is mapped to a vector $\zeta_{s}$ via a fully connected layer. We then perform an element-wise product operation to $\zeta_{c}$ and $\zeta_{s}$, which is the final style code. The style code is then mapped to produce the AdaIN parameters for generating the translation. Through this element-wise product operation, the resulting style code is heavily influenced by the content image. One way to look at this mechanism is that it produces a customized style code for the input content image.

The COCO is used as a drop-in replacement for the style encoder in FUNIT. Let $\phi$ denote the COCO mapping. The translation output is then computed via

$$ z_{c}=E_{c}\left(x_{c}\right), z_{s}=\phi\left(E_{s, s}\left(x_{s}\right), E_{s, c}\left(x_{c}\right)\right), \overline{\mathbf{x}}=F\left(z_{c}, z_{s}\right) $$

The style code extracted by the COCO is more robust to variations in the style image. Note that we set $E_{S, C} \equiv E_{C}$ to keep the number of parameters in our model similar to that in FUNIT.

Source: COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder

Papers


Paper Code Results Date Stars

Tasks


Categories