As the first novelty, we propose an attention mechanism which focuses on more discriminative clips and directly optimizes for video-level (cf.
Experiments on the HTI dataset show that our method outperforms the baseline per-frame image fidelity and spatial-temporal consistency.
Videos are created to express emotion, exchange information, and share experiences.
In this paper, we investigate using rolling shutter with a global reset feature (RSGR) to restore clean global shutter (GS) videos.
We propose a test-time adaptation method for cross-domain image segmentation.
Our method supports rendering with a single network evaluation per pixel for small baseline light field datasets and can also be applied to larger baselines with only a few evaluations per pixel.
We present an algorithm for re-rendering a person from a single image under arbitrary poses.
Based on this insight, we develop DropLoss -- a novel adaptive loss to compensate for this imbalance without a trade-off between rare and frequent categories.
Data augmentation is a ubiquitous technique for improving image classification when labeled data is scarce.
Existing video stabilization methods often generate visible distortion or require aggressive cropping of frame boundaries, resulting in smaller field of views.
Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains.
We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes.
Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior for solving various inverse image restoration tasks.
We present a learning-based approach for removing unwanted obstructions, such as window reflections, fence occlusions, or adherent raindrops, from a short sequence of images captured by a moving camera.
Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation.
Recent state-of-the-art semi-supervised learning (SSL) methods use a combination of image-based transformations and consistency regularization as core components.
We propose a method for converting a single RGB-D input image into a 3D photo - a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view.
We model the HDRto-LDR image formation pipeline as the (1) dynamic range clipping, (2) non-linear mapping from a camera response function, and (3) quantization.
We present a learning-based approach for removing unwanted obstructions, such as window reflections, fence occlusions or raindrops, from a short sequence of images captured by a moving camera.
Establishing dense semantic correspondences between object instances remains a challenging problem due to background clutter, significant scale and pose differences, and large intra-class variations.
Few-shot classification aims to recognize novel categories with only few labeled images in each class.
Unsupervised domain adaptation algorithms aim to transfer the knowledge learned from one domain to another (e. g., synthetic to real images).
We validate the effectiveness of our method by transferring our pre-trained model to three different tasks, including action classification, temporal localization, and spatio-temporal action detection.
We address the problem of guided image-to-image translation where we translate an input image into another while respecting the constraints provided by an external, user-provided guidance image.
Ranked #1 on Image Reconstruction on Edge-to-Clothes
In contrast to existing algorithms that tackle the tasks of semantic matching and object co-segmentation in isolation, our method exploits the complementary nature of the two tasks.
We then show that when combined with these regularizers, the proposed method facilitates the propagation of information from generated prototypes to image data to further improve results.
In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images.
Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples.
We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences.
We even demonstrate competitive results comparable to deep learning based methods in the semi-supervised setting on the DAVIS dataset.
Ranked #3 on Video Salient Object Detection on DAVSOD-Difficult20 (using extra training data)
Due to the formulation as a prediction task, most of these methods require fine-tuning during test time, such that the deep nets memorize the appearance of the objects of interest in the given video.
Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction.
Ranked #3 on Human-Object Interaction Detection on Ambiguious-HOI
Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time.
Our method takes the original unprocessed and per-frame processed videos as inputs to produce a temporally consistent video.
In classification adaptation, we transfer a pre-trained network to a multi-label classification task for recognizing the presence of a certain object in an image.
In contrast to existing methods that consider only the guidance image, the proposed algorithm can selectively transfer salient structures that are consistent with both guidance and target images.
Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up.
However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results.
Our core idea is that the adversarial examples targeting at a neural network-based policy are not effective for the frame prediction model.
We present an unsupervised representation learning approach using videos without semantic labels.
Ranked #39 on Self-Supervised Action Recognition on HMDB51
Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance.
Second, we learn a correlation filter over a feature pyramid centered at the estimated target position for predicting scale changes.
We propose a new deep network architecture for removing rain streaks from individual images based on the deep convolutional neural network (CNN).
First, we exploit the discriminative constraints to capture the intra- and inter-class relationships of image embeddings.
Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution.
Ranked #33 on Image Super-Resolution on Urban100 - 4x upscaling
We introduce a deep network architecture called DerainNet for removing rain streaks from an image.
Ranked #9 on Single Image Deraining on Test100
In this paper, we address this problem by progressive domain adaptation with two main steps: classification adaptation and detection adaptation.
Using these datasets, we conduct a large-scale user study to quantify the performance of several representative state-of-the-art blind deblurring algorithms.
The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to significant appearance variations.
However, the internal dictionary obtained from the given image may not always be sufficiently expressive to cover the textural appearance variations in the scene.