We mitigate this by descending the deep layers of a pre-trained network, where the deep features contain more semantics, and applying the translation between these deep features.
We introduce an unsupervised technique for encoding point clouds into a canonical shape representation, by disentangling shape and pose.
Truncation is widely used in generative models for improving the quality of the generated samples, at the expense of reducing their diversity.
In particular, we demonstrate that while StyleGAN3 can be trained on unaligned data, one can still use aligned data for training, without hindering the ability to generate unaligned imagery.
We present ShapeFormer, a transformer-based network that produces a distribution of object completions, conditioned on incomplete, and possibly noisy, point clouds.
Chart question answering (CQA) is a task used for assessing chart comprehension, which is fundamentally different from understanding natural images.
Ranked #1 on Visual Question Answering on PlotQA-D1
Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation.
The reason is that the learnt weights for balancing the importance between the shape and base components in ShapeConv become constants in the inference phase, and thus can be fused into the following convolution, resulting in a network that is identical to one with vanilla convolutional layers.
Ranked #3 on Semantic Segmentation on Stanford2D3D
In the second stage, we merge the rooted models by averaging their weights and fine-tuning them for each specific domain, using only data generated by the original trained models.
Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images.
Ranked #1 on Image Manipulation on 10-Monty-Hall (using extra training data)
Edge-preserving filters play an essential role in some of the most basic tasks of computational photography, such as abstraction, tonemapping, detail enhancement and texture removal, to name a few.
Manipulation of visual attributes via these StyleSpace controls is shown to be better disentangled than via those proposed in previous works.
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video. While previous methods rely on either rigging or inverse kinematics (IK) to associate a consistent skeleton with temporally coherent joint rotations, our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
Moreover, in the inference phase, the depthwise convolution is folded into the conventional convolution, reducing the computation to be exactly equivalent to that of a convolutional layer without over-parameterization.
In this paper, we present a novel data-driven framework for motion style transfer, which learns from an unpaired collection of motions with style labels, and enables transferring motion styles not observed during training.
In other words, our operators form the building blocks of a new deep motion processing framework that embeds the motion into a common latent space, shared by a collection of homeomorphic skeletons.
The emergence of deep generative models has recently enabled the automatic generation of massive amounts of graphical content, both in 2D and in 3D.
Reflections are very common phenomena in our daily photography, which distract people's attention from the scene behind the glass.
In this paper, we present a new, physically-based, approach for estimating illuminant chromaticity from interreflections of light between diffuse surfaces.
Our translation is performed in a cascaded, deep-to-shallow, fashion, along the deep feature hierarchy: we first translate between the deepest layers that encode the higher-level semantic content of the image, proceeding to translate the shallower layers, conditioned on the deeper ones.
In order to achieve our goal, we learn to extract, directly from a video, a high-level latent motion representation, which is invariant to the skeleton geometry and the camera view.
Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation.
Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information, and global scene context.
After training a deep generative network using a reference video capturing the appearance and dynamics of a target actor, we are able to generate videos where this actor reenacts other performances.
The key idea is that during the analysis, the two branches exchange information between them, thereby learning the dependencies between structure and geometry and encoding two augmented features, which are then fused into a single latent code.
The key idea is that by learning to separately extract both the common and the domain-specific features, one can synthesize more target domain data with supervision, thereby boosting the domain adaptation performance.
We demonstrate that this conceptually simple approach is highly effective for capturing large-scale structures, as well as other non-stationary attributes of the input exemplar.
Correspondence between images is a fundamental problem in computer vision, with a variety of graphics applications.
Contextual information provides important cues for disambiguating visually similar pixels in scene segmentation.
We show that the resulting P-maps may be used to evaluate how likely a rectangle proposal is to contain an instance of the class, and further process good proposals to produce an accurate object cutout mask.
Human 3D pose estimation from a single image is a challenging task with numerous applications.