Moreover, even for single image based monocular deraining, many current methods fail to complete the task satisfactorily because they mostly rely on per pixel loss functions and ignoring semantic information.
It is thus unclear how these algorithms perform on public face hallucination datasets.
In active visual tracking, it is notoriously difficult when distracting objects appear, as distractors often mislead the tracker by occluding the target or bringing a confusing appearance.
In order to reduce network parameters, the intra-stage recursive computation of ResNet is adopted in our Stack T-Net.
Single-image super-resolution (SR) and multi-frame SR are two ways to super resolve low-resolution images.
We first use a coarse deraining network to reduce the rain streaks on the input images, and then adopt a pre-trained semantic segmentation network to extract semantic features from the coarse derained image.
Video deraining is an important task in computer vision as the unwanted rain hampers the visibility of videos and deteriorates the robustness of most outdoor vision systems.
Images captured in snowy days suffer from noticeable degradation of scene visibility, which degenerates the performance of current vision-based intelligent systems.
In addition, to further refine the result, a Differential-driven Dual Attention-in-Attention Model (D-DAiAM) is proposed with a "heavy-to-light" scheme to remove rain via addressing the unsatisfying deraining regions.
Increasingly, modern mobile devices allow capturing images at Ultra-High-Definition (UHD) resolution, which includes 4K and 8K images.
Also, we build a new dataset, namely iPER dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis.
To address this problem, we propose a new method which combines two GAN models, i. e., a learning-to-Blur GAN (BGAN) and learning-to-DeBlur GAN (DBGAN), in order to learn a better model for image deblurring by primarily learning how to blur images.
Ranked #6 on Deblurring on HIDE (trained on GOPRO)
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.
In this paper, we aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks.
Ablation studies have validated the effectiveness of both the proposed gated multi-layer feature extraction sub-network and the deformable occlusion handling sub-network.
In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape.
In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video.
In AD-VAT, both the tracker and the target are approximated by end-to-end neural networks, and are trained via RL in a dueling/competitive manner: i. e., the tracker intends to lockup the target, while the target tries to escape from the tracker.
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
Ranked #5 on Scene Graph Generation on Visual Genome
To address the training difficulty, we propose a training algorithm using a tighter approximation to the derivative of the sign function, a magnitude-aware gradient for weight updating, a better initialization method, and a two-step scheme for training a deep network.
We further propose an environment augmentation technique and a customized reward function, which are crucial for successful training.
In this work, we study the 1-bit convolutional neural networks (CNNs), of which both the weights and activations are binary.
To tackle the second challenge, we leverage the developed DBLRNet as a generator in the GAN (generative adversarial network) architecture, and employ a content loss in addition to an adversarial loss for efficient adversarial training.
More specifically, a hybrid loss is proposed to capitalize on the content information of input frames, the style information of a given style image, and the temporal information of consecutive frames.
We study active object tracking, where a tracker takes as input the visual observation (i. e., frame sequence) and produces the camera control signal (e. g., move forward, turn left, etc.).
We inspect the recent advances in various aspects and propose some interesting directions for future research.
In this paper, we propose a label propagation framework to handle the multiple object tracking (MOT) problem for a generic object type (cf.
In this paper, we present a unified method for joint face image analysis, i. e., simultaneously estimating head pose, facial expression and landmark positions in real-world face images.