Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL.
To this end, we propose spatially probabilistic diversity normalization (SPDNorm) inside the modulation to model the probability of generating a pixel conditioned on the context information.
The forward inference projects input images into deep features, while the backward inference remaps deep features back to input images in a lossless and unbiased way.
Recently, adversarial attack has been applied to visual object tracking to evaluate the robustness of deep trackers.
While existing methods combine an input image and these low-level controls for CNN inputs, the corresponding feature representations are not sufficient to convey user intentions, leading to unfaithfully generated content.
By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning.
Ranked #62 on Action Recognition on HMDB-51
A recent pioneering work employed knowledge distillation to reduce the dependency of human parsing, where the try-on images produced by a parser-based method are used as supervisions to train a "student" network without relying on segmentation, making the student mimic the try-on ability of the parser-based model.
We further analyze the KL-divergence of the proposed loss function and find that the loss stabilization term makes the perturbations updated towards a fixed objective spot while deviating from the ground truth.
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light.
The advancement of visual tracking has continuously been brought by deep learning models.
We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively.
The displacement map and the coarse model are used to render a final detailed face, which again can be compared with the original input image to serve as a photometric loss for the second stage.
In the distillation process, we propose a fidelity loss to enable the student network to maintain the representation capability of the teacher network.
The main ingredient of the view alignment loss is a differentiable dense optical flow estimator that can backpropagate the alignment errors between an input view and a synthetic rendering from another input view, which is projected to the target view through the 3D shape to be inferred.
We first propose a facial component guided deep Convolutional Neural Network (CNN) to restore a coarse face image, which is denoted as the base image where the facial component is automatically generated from the input face image.
Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data.
In addition, we also propose a gated fusion scheme to control how the variations captured by the deformable convolution affect the original appearance.
The proposed network is composed of three deep convolutional neural networks (CNNs) and a recurrent neural network (RNN).
Ranked #4 on Deblurring on RealBlur-R (trained on GoPro)
To augment positive samples, we use a generative network to randomly generate masks, which are applied to adaptively dropout input features to capture a variety of appearance changes.
Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training.
Exemplar-based face sketch synthesis methods usually meet the challenging problem that input photos are captured in different lighting conditions from training photos.