Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network.
We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions.
We propose Neural Neighbor Style Transfer (NNST), a pipeline that offers state-of-the-art quality, generalization, and competitive efficiency for artistic style transfer.
Recent image inpainting methods have made great progress but often struggle to generate plausible image structures when dealing with large holes in complex images.
Instead of modeling this complex domain with a single GAN, we propose a novel method to combine multiple pretrained GANs, where one GAN generates a global canvas (e. g., human body) and a set of specialized GANs, or insets, focus on different parts (e. g., faces, shoes) that can be seamlessly inserted onto the global canvas.
In particular, we demonstrate that while StyleGAN3 can be trained on unaligned data, one can still use aligned data for training, without hindering the ability to generate unaligned imagery.
Reference-guided image inpainting restores image pixels by leveraging the content from another reference image.
We introduce a high resolution, 3D-consistent image and shape generation technique which we call StyleSDF.
Can the collective "knowledge" from a large bank of pretrained vision models be leveraged to improve GAN training?
Ranked #1 on Image Generation on AFHQ Cat
We propose GAN-Supervised Learning, a framework for learning discriminative models and their GAN-generated training data jointly end-to-end.
Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation.
We present an approach to example-based stylization of images that uses a single pair of a source image and its stylized counterpart.
Dozens of saliency models have been designed over the last few decades, targeted at diverse applications ranging from image compression and retargeting to robot navigation, surveillance, and distractor detection.
We present an algorithm for re-rendering a person from a single image under arbitrary poses.
Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification.
Training generative models, such as GANs, on a target domain containing limited examples (e. g., 10) can easily result in overfitting.
Our approach produces generalizable functional representations of images, videos and shapes, and achieves higher reconstruction quality than prior works that are optimized for a single signal.
Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images.
Ranked #1 on Image Manipulation on 10-Monty-Hall (using extra training data)
Image inpainting is the task of plausibly restoring missing pixels within a hole region that is to be removed from a target image.
Our model generates novel poses based on keypoint locations, which can be modified in real time while providing interactive feedback, allowing for intuitive reposing and animation.
We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation.
Manipulation of visual attributes via these StyleSpace controls is shown to be better disentangled than via those proposed in previous works.
Extensions of our model allow for multi-style edits and the ability to both increase and attenuate attention in an image region.
Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging.
To address this challenge, we propose an iterative inpainting method with a feedback mechanism.
In image morphing, a sequence of plausible frames are synthesized and composited together to form a smooth transformation between given instances.
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
no code implementations • 8 Apr 2020 • Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, Rohit Pandey, Sean Fanello, Gordon Wetzstein, Jun-Yan Zhu, Christian Theobalt, Maneesh Agrawala, Eli Shechtman, Dan B. Goldman, Michael Zollhöfer
Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e. g., by the integration of differentiable rendering into network training.
We present a method to improve the visual realism of low-quality, synthetic images, e. g. OpenGL renderings.
Most existing aging methods are limited to changing the texture, overlooking transformations in head shape that occur during the human aging and growth process.
We capture these subtle changes by applying an image translation network to refine the mesh rendering, providing an end-to-end model to generate new animations of a character with high visual quality.
We propose an interactive GAN-based sketch-to-image translation method that helps novice users create images of simple objects.
We introduce UprightNet, a learning-based approach for estimating 2DoF camera orientation from a single RGB image of an indoor scene.
To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material.
In this paper, we address the problem of 3D object mesh reconstruction from RGB videos.
To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.
Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode.
Our method takes the original unprocessed and per-frame processed videos as inputs to produce a temporally consistent video.
We address the problem of finding realistic geometric corrections to a foreground object such that it appears natural when composited into a background image.
We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics.
In this work, we focus on the challenge of taking partial observations of highly-stylized text and generalizing the observations to generate unobserved glyphs in the ornamented typeface.
Our proposed method encourages bijective consistency between the latent encoding and output modes.
In many computer vision tasks, for example saliency prediction or semantic segmentation, the desired output is a foreground map that predicts pixels where some criteria is satisfied.
A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.
Traditional face editing methods often require a number of sophisticated and task specific algorithms to be applied one after the other --- a process that is tedious, fragile, and computationally intensive.
This paper introduces a deep-learning approach to photographic style transfer that handles a large variety of image content while faithfully transferring the reference style.
Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal.
Neural Style Transfer has shown very exciting results enabling new forms of image manipulation.
Realistic image manipulation is challenging because it requires modifying the image appearance in a user-controlled way, while preserving the realism of the result.
In this work we propose a fully automatic shadow region harmonization approach that improves the appearance compatibility of the de-shadowed region as typically produced by previous methods.
In this work, we investigate the problem of automatically inferring the lattice structure of near-regular textures (NRT) in real-world images.
As font is one of the core design concepts, automatic font identification and similar font suggestion from an image or photo has been on the wish list of many designers.
Ranked #1 on Font Recognition on VFR-Wild
We address a challenging fine-grain classification problem: recognizing a font style from an image of text.
We present a domain adaption framework to address a domain mismatch between synthetic training and real-world testing data.
This paper addresses the large-scale visual font recognition (VFR) problem, which aims at automatic identification of the typeface, weight, and slope of the text in an image or photo without any knowledge of content.
Ranked #1 on Font Recognition on VFR-447
For example, the time each video frame is observed is a fraction of a second, while a still image can be viewed leisurely.
In this work we propose a crowdsourced method for acquisition of gaze direction data from a virtually unlimited number of participants, using a robust self-reporting mechanism (see Figure 1).
Social and Information Networks Human-Computer Interaction