A coarse-to-fine approach for dynamic-to-static image translation
Dynamic-to-static image translation aims to convert the dynamic scene into static so that dynamic elements are eliminated from the image. Recent works typically see the problem as an image-to-image translation task, and perform the learned feature mapping over the whole dynamic image to synthesize the static image, which leads to unnecessary detail loss in original static regions. To that end, we delicately formulate it as an image inpainting-like problem to fill the missing static pixels in dynamic regions while retaining original static regions. We achieve this by proposing a coarse-to-fine framework. At coarse stage, we utilize a simple encoder-decoder network to rough out the static image. Using the coarse predicted image, we explicitly infer a more accurate dynamic mask to identify both dynamic objects and their shadows, so that the task could be effectively converted to an image inpainting problem. At fine stage, we recover the missing static pixels in the estimated dynamic regions on the basis of their coarse predictions. We enhance the coarse predicted contents by proposing a mutual texture-structure attention module, which enables the dynamic regions to borrow textures and structures separately from distant locations based on contextual similarity. Several losses are combined as the training objective function to generate excellent results with global consistency and fine details. Qualitative and quantitative experiments verify the superiority of our method in restoring high-quality static contents over state-of-the-art models. In addition, we evaluate the usefulness of the recovered static images by using them as query images to improve visual place recognition in dynamic scenes.
PDF Abstract