We test the effectiveness of our representation on the human image harmonization task by predicting shading that is coherent with a given background image.
Our framework leverages the best of non-parametric and model-based methods and is also robust to partial occlusion.
Appearance of dressed humans undergoes a complex geometric transformation induced not only by the static pose but also by its dynamics, i. e., there exists a number of cloth geometric configurations given a pose depending on the way it has moved.
Self-contacts, such as when hands touch each other or the torso or the head, are important attributes of human body language and dynamics, yet existing methods do not model or preserve these contacts.
We present an algorithm for re-rendering a person from a single image under arbitrary poses.
A long-standing goal in computer vision is to capture, model, and realistically synthesize human behavior.
We present a single-image data-driven method to automatically relight images with full-body humans in them.
We demonstrate the effectiveness of our hierarchical motion variational autoencoder in a variety of tasks including video-based human pose estimation, motion completion from partial observations, and motion synthesis from sparse key-frames.
Ranked #4 on motion synthesis on LaFAN1
In this paper, we introduce Attribute-conditioned Layout GAN to incorporate the attributes of design elements for graphic layout generation by forcing both the generator and the discriminator to meet attribute conditions.
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles.
To address this challenge, we propose an iterative inpainting method with a feedback mechanism.
We introduce a biomechanically constrained generative adversarial network that performs long-term inbetweening of human motions, conditioned on keyframe constraints.
According to this depth estimate, our framework then maps the input image to a point cloud and synthesizes the resulting video frames by rendering the point cloud from the corresponding camera positions.
Ranked #1 on Depth Estimation on NYU-Depth V2
We show that methods trained on our dataset consistently perform well when tested on other datasets.
We experimentally demonstrate the strength of our approach over different non-hierarchical and hierarchical baselines.
An assumption widely used in recent neural style transfer methods is that image styles can be described by global statics of deep features like Gram or covariance matrices.
We show that by such disentanglement, the contour completion model predicts reasonable contours of objects, and further substantially improves the performance of image inpainting.
Doodling is a useful and common intelligent skill that people can learn and master.
Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame.
We present a generative image inpainting system to complete images with free-form mask and guidance.
Ranked #3 on Image Inpainting on Places2 val
The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image.
Ranked #2 on Plane Instance Segmentation on NYU Depth v2
We propose a recurrent neural network architecture with a Forward Kinematics layer and cycle consistency based adversarial training objective for unsupervised motion retargetting.
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #2 on 3D Human Pose Estimation on Surreal (using extra training data)
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Ranked #8 on Referring Expression Segmentation on RefCOCO testA
Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.
The ability of predicting the future is important for intelligent systems, e. g. autonomous vehicles and robots to plan early and make decisions accordingly.
Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors.
The success of various applications including robotics, digital content creation, and visualization demand a structured and abstract representation of the 3D world from limited sensor data.
We propose an end-to-end network architecture that replicates the forward image formation process to accomplish this task.
In this paper, we propose a novel segmentation approach that uses a rectangle as a soft constraint by transforming it into an Euclidean distance map.
To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.
Ranked #1 on Video Prediction on KTH (Cond metric)
The whitening and coloring transforms reflect a direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer.
To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions.
In this paper we are interested in the problem of image segmentation given natural language descriptions, i. e. referring expressions.
Instead of taking a 'blank slate' approach, we first explicitly infer the parts of the geometry visible both in the input and novel views and then re-cast the remaining synthesis problem as image completion.
Recent progresses on deep discriminative and generative modeling have shown promising results on texture synthesis.
In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations.
We demonstrate the ability of the model in generating 3D volume from a single 2D image with three sets of experiments: (1) learning from single-class objects; (2) learning from multi-class objects and (3) testing on novel object classes.
Structured support vector machine (SSVM) based methods has demonstrated encouraging performance in recent object tracking benchmarks.
We develop a deep learning algorithm for contour detection with a fully convolutional encoder-decoder network.
This paper investigates a novel problem of generating images from visual attributes.
The transferred local shape masks constitute a patch-level segmentation solution space and we thus develop a novel cascade algorithm, PatchCut, for coarse-to-fine object segmentation.
This paper presents a scalable scene parsing algorithm based on image retrieval and superpixel matching.