We study conditional image repainting where a model is trained to generate visual content conditioned on user inputs, and composite the generated content seamlessly onto a user provided image while preserving the semantics of users' inputs.
Frequency aliasing in the digital capture of display screens leads to the moir´e pattern, appearing as stripe-shaped distortions in images.
Based on this insight, we propose a guided calibration network, named GCNet, that explicitly leverages object shape and shading information for improved lighting estimation.
With the rapid development of high-resolution 3D vision applications, the traditional way of manipulating surface detail requires considerable memory and computing time.
Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions.
Scene Dynamic Recovery (SDR) by inverting distorted Rolling Shutter (RS) images to an undistorted high frame-rate Global Shutter (GS) video is a severely ill-posed problem due to the missing temporal dynamic information in both RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions is unavailable.
Although the ambiguity is alleviated on non-Lambertian objects, the problem is still difficult to solve for more general objects with complex shapes introducing irregular shadows and general materials with complex reflectance like anisotropic reflectance.
Experiments on EvRealHands demonstrate that EvHandPose outperforms previous event-based methods under all evaluation scenes, achieves accurate and stable hand pose estimation with high temporal resolution in fast motion and strong light scenes compared with RGB-based methods, generalizes well to outdoor scenes and another type of event camera, and shows the potential for the hand gesture recognition task.
Given an RGB image focused at an arbitrary distance, we explore the high temporal resolution of event streams, from which we automatically select refocusing timestamps and reconstruct corresponding refocused images with events to form a focal stack.
Capturing high frame rate and high dynamic range (HFR&HDR) color videos in high-speed scenes with conventional frame-based cameras is very challenging.
Relighting an outdoor scene is challenging due to the diverse illuminations and salient cast shadows.
Language-based colorization produces plausible colors consistent with the language description provided by the user.
Limited by the trade-off between frame rate and exposure time when capturing moving scenes with conventional cameras, frame based HDR video reconstruction suffers from scene-dependent exposure ratio balancing and ghosting artifacts.
We summarize the performance of deep learning photometric stereo models on the most widely-used benchmark data set.
In this paper, we propose a deep neural network named NeuralMPS to solve the MPS problem under general non-Lambertian spectral reflectances.
Automatic image colorization is an ill-posed problem with multi-modal uncertainty, and there remains two main challenges with previous methods: incorrect semantic colors and under-saturation.
Uncalibrated photometric stereo (UPS) is challenging due to the inherent ambiguity brought by unknown light.
This paper presents a near-light photometric stereo method that faithfully preserves sharp depth edges in the 3D reconstruction.
The enhancement is done by jointly optimizing the Retinex decomposition and the illumination adjustment.
no code implementations • 23 Jan 2022 • Tiejun Huang, Yajing Zheng, Zhaofei Yu, Rui Chen, Yuan Li, Ruiqin Xiong, Lei Ma, Junwei Zhao, Siwei Dong, Lin Zhu, Jianing Li, Shanshan Jia, Yihua Fu, Boxin Shi, Si Wu, Yonghong Tian
By treating vidar as spike trains in biological vision, we have further developed a spiking neural network-based machine vision system that combines the speed of the machine and the mechanism of biological vision, achieving high-speed object detection and tracking 1, 000x faster than human vision.
Conditional image repainting (CIR) is an advanced image editing task, which requires the model to generate visual content in user-specified regions conditioned on multiple cross-modality constraints, and composite the visual content with the provided background seamlessly.
Evaluating photometric stereo using real-world dataset is important yet difficult.
Haze, a common kind of bad weather caused by atmospheric scattering, decreases the visibility of scenes and degenerates the performance of computer vision algorithms.
Further, for training SCFlow, we synthesize two sets of optical flow data for the spiking camera, SPIkingly Flying Things and Photo-realistic High-speed Motion, denoted as SPIFT and PHM respectively, corresponding to random high-speed and well-designed scenes.
Based on this, we propose a new method that amends the label distribution of each facial image by leveraging correlations among expressions in the semantic space.
Experimental results on analytically computed, synthetic, and real-world surfaces show that our method yields accurate and stable reconstruction for both orthographic and perspective normal maps.
To make the problem well-posed, existing MPS methods rely on restrictive assumptions, such as shape prior, surfaces having a monochromatic with uniform albedo.
EventZoom is trained in a noise-to-noise fashion where the two ends of the network are unfiltered noisy events, enforcing noise-free event restoration.
Mimicking the sampling mechanism of the fovea, a retina-inspired camera, named spiking camera, is developed to record the external information with a sampling rate of 40, 000 Hz, and outputs asynchronous binary spike streams.
This paper studies the problem of panoramic image reflection removal, aiming at reliving the content ambiguity between reflection and transmission scenes.
We propose DeRenderNet, a deep neural network to decompose the albedo and latent lighting, and render shape-(in)dependent shadings, given a single image of an outdoor urban scene, trained in a self-supervised manner.
We train a deep neural network to regress intrinsic cues with physically-based constraints and use them to conduct global and local lightings estimation.
To reconstruct high-resolution intensity images from event data, we propose EvIntSR-Net that converts event data to multiple latent intensity frames to achieve super-resolution on intensity images in this paper.
In this work, we extended the contextual encoding layer that was originally designed for 2D tasks to 3D Point Cloud scenarios.
For all-pixel operation, we propose the Normal Regression Network to make efficient use of the intra-image spatial information for predicting a surface normal map with rich details.
A conventional camera often suffers from over- or under-exposure when recording a real-world scene with a very high dynamic range (HDR).
We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for Video Frame Interpolation (VFI).
To deal with the uncalibrated scenario where light directions are unknown, we introduce a new convolutional network, named LCNet, to estimate light directions from input images.
A graph convolutional neural network is introduced to predict the performance of architectures based on the learned representations and their relation modeled by the graph.
To promote the capability of student generator, we include a student discriminator to measure the distances between real images, and images generated by student and teacher generators.
On one hand, massive trainable parameters significantly enhance the performance of these deep networks.
In contrast, it is more reasonable to treat the generated data as unlabeled, which could be positive or negative according to their quality.
To facilitate end-to-end training, we further develop a scenario context information extraction branch to extract context information from raw RGB video directly.
Ranked #74 on Skeleton Based Action Recognition on NTU RGB+D
From a single viewpoint, we use a set of photometric stereo images to identify surface points with the same distance to the camera.
The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values.
When we take photos through glass windows or doors, the transmitted background scene is often blended with undesirable reflection.
We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function.
Architectures in the population that share parameters within one SuperNet in the latest generation will be tuned over the training dataset with a few epochs.
no code implementations • 24 Jul 2019 • Shaodi You, Erqi Huang, Shuaizhe Liang, Yongrong Zheng, Yunxiang Li, Fan Wang, Sen Lin, Qiu Shen, Xun Cao, Diming Zhang, Yuanjiang Li, Yu Li, Ying Fu, Boxin Shi, Feng Lu, Yinqiang Zheng, Robby T. Tan
This document introduces the background and the usage of the Hyperspectral City Dataset and the benchmark.
Specifically, we exploit the unlabeled data to mimic the classification characteristics of giant networks, so that the original capacity can be preserved nicely.
This paper solves the Sparse Photometric stereo through Lighting Interpolation and Normal Estimation using a generative Network (SPLINE-Net).
Learning portable neural networks is very essential for computer vision for the purpose that pre-trained heavy deep models can be well applied on edge devices such as mobile phones and micro sensors.
This paper makes a first attempt to bring the Shape from Polarization (SfP) problem to the realm of deep learning.
Temporal Video Frame Synthesis (TVFS) aims at synthesizing novel frames at timestamps different from existing frames, which has wide applications in video codec, editing and analysis.
In this way, a portable student network with significantly fewer parameters can achieve a considerable accuracy which is comparable to that of teacher network.
As compared with traditional video retargeting, stereo video retargeting poses new challenges because stereo video contains the depth information of salient objects and its time dynamics.
Removing the undesired reflections from images taken through the glass is of broad application to various computer vision tasks.
Removing undesired reflections from a photo taken in front of a glass is of great importance for enhancing the efficiency of visual computing systems.
Recent developments in the field have enabled shape recovery techniques for surfaces of various types, but an effective solution to directly estimating the surface normal in the presence of highly specular reflectance remains elusive.
Radiometrically calibrating the images from Internet photo collections brings photometric analysis from lab data to big image data in the wild, but conventional calibration methods cannot be directly applied to such image data.
As prior knowledge of objects or object features helps us make relations for similar objects on attentional tasks, pre-trained deep convolutional neural networks (CNNs) can be used to detect salient objects on images regardless of the object class is in the network knowledge or not.
Recent progress on photometric stereo extends the technique to deal with general materials and unknown illumination conditions.
We propose a framework to overcome these key challenges, allowing the benefits of polarization to be used to enhance depth maps.