We mitigate this by descending the deep layers of a pre-trained network, where the deep features contain more semantics, and applying the translation between these deep features.
Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction.
Our formulation also accounts for the correlation that exists between the condition image and the samples along the modified diffusion process.
We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts, along with a 3D ROI box.
These questions form a dialog with the user in order to retrieve the desired image from a large corpus.
Ranked #1 on Chat-based Image Retrieval on VisDial
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts.
We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability.
Ranked #1 on Image Retrieval on LaSCo
Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based.
Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
We introduce an unsupervised technique for encoding point clouds into a canonical shape representation, by disentangling shape and pose.
Truncation is widely used in generative models for improving the quality of the generated samples, at the expense of reducing their diversity.
In particular, we demonstrate that while StyleGAN3 can be trained on unaligned data, one can still use aligned data for training, without hindering the ability to generate unaligned imagery.
We present ShapeFormer, a transformer-based network that produces a distribution of object completions, conditioned on incomplete, and possibly noisy, point clouds.
Our model is particularly well suited for realistic questions with out-of-vocabulary answers that require regression.
Ranked #2 on Visual Question Answering (VQA) on PlotQA-D1
Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation.
The reason is that the learnt weights for balancing the importance between the shape and base components in ShapeConv become constants in the inference phase, and thus can be fused into the following convolution, resulting in a network that is identical to one with vanilla convolutional layers.
Ranked #3 on Semantic Segmentation on Stanford2D3D - RGBD
In the second stage, we merge the rooted models by averaging their weights and fine-tuning them for each specific domain, using only data generated by the original trained models.
Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images.
Edge-preserving filters play an essential role in some of the most basic tasks of computational photography, such as abstraction, tonemapping, detail enhancement and texture removal, to name a few.
Manipulation of visual attributes via these StyleSpace controls is shown to be better disentangled than via those proposed in previous works.
Moreover, in the inference phase, the depthwise convolution is folded into the conventional convolution, reducing the computation to be exactly equivalent to that of a convolutional layer without over-parameterization.
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video. While previous methods rely on either rigging or inverse kinematics (IK) to associate a consistent skeleton with temporally coherent joint rotations, our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
In this paper, we present a novel data-driven framework for motion style transfer, which learns from an unpaired collection of motions with style labels, and enables transferring motion styles not observed during training.
In other words, our operators form the building blocks of a new deep motion processing framework that embeds the motion into a common latent space, shared by a collection of homeomorphic skeletons.
The emergence of deep generative models has recently enabled the automatic generation of massive amounts of graphical content, both in 2D and in 3D.
In this paper, we present a new, physically-based, approach for estimating illuminant chromaticity from interreflections of light between diffuse surfaces.
Our translation is performed in a cascaded, deep-to-shallow, fashion, along the deep feature hierarchy: we first translate between the deepest layers that encode the higher-level semantic content of the image, proceeding to translate the shallower layers, conditioned on the deeper ones.
In order to achieve our goal, we learn to extract, directly from a video, a high-level latent motion representation, which is invariant to the skeleton geometry and the camera view.
Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation.
Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information, and global scene context.
After training a deep generative network using a reference video capturing the appearance and dynamics of a target actor, we are able to generate videos where this actor reenacts other performances.
The key idea is that during the analysis, the two branches exchange information between them, thereby learning the dependencies between structure and geometry and encoding two augmented features, which are then fused into a single latent code.
The key idea is that by learning to separately extract both the common and the domain-specific features, one can synthesize more target domain data with supervision, thereby boosting the domain adaptation performance.
We demonstrate that this conceptually simple approach is highly effective for capturing large-scale structures, as well as other non-stationary attributes of the input exemplar.
Correspondence between images is a fundamental problem in computer vision, with a variety of graphics applications.
Contextual information provides important cues for disambiguating visually similar pixels in scene segmentation.
We show that the resulting P-maps may be used to evaluate how likely a rectangle proposal is to contain an instance of the class, and further process good proposals to produce an accurate object cutout mask.
Human 3D pose estimation from a single image is a challenging task with numerous applications.