To conduct this study on solid ground with practical value in mind, we first propose a generic, cost-effective Transformer-based framework for image processing.
Long-range temporal alignment is critical yet challenging for video restoration tasks.
Ranked #1 on Video Super-Resolution on Vimeo-90K
In this paper, we propose a novel design of Sparse Steerable Convolution (SS-Conv) to address the shortcoming; SS-Conv greatly accelerates steerable convolution with sparse tensors, while strictly preserving the property of SE(3)-equivariance.
Single image super-resolution (SISR) deals with a fundamental problem of upsampling a low-resolution (LR) image to its high-resolution (HR) version.
We consider the single image super-resolution (SISR) problem, where a high-resolution (HR) image is generated based on a low-resolution (LR) input.
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Ranked #1 on Unsupervised Video Object Segmentation on DAVIS 2017 (val) (using extra training data)
It can simultaneously achieve the noise level estimation and the image prior learning directly from only a single noisy image.
Skeleton-based action recognition has attracted research attentions in recent years.
In this work, we propose an interactive system to design diverse high-quality garment images from fashion sketches and the texture information.
Motivated by these findings, we propose a temporal multi-correspondence aggregation strategy to leverage similar patches across frames, and a cross-scale nonlocal-correspondence aggregation scheme to explore self-similarity of images across scales.
Object skeletonization in a single natural image is a challenging problem because there is hardly any prior knowledge about the object.
In this paper, we propose an efficient local-to-global method to identify background, based on the assumption that as long as there is sufficient camera motion, the cumulative background features will have the largest amount of trajectories.
In this paper, a novel deep-learning based framework is proposed to infer 3D human poses from a single image.
Although significant advances have been made in the area of human poses estimation from images using deep Convolutional Neural Network (ConvNet), it remains a big challenge to perform 3D pose inference in-the-wild.
Using efficient but robust registration enables us to combine multiple frames of a scene in near real time and generate 3D bounding boxes for potential 3D regions of interest.
The choice of motion models is vital in applications like image/video stitching and video stabilization.
However, the quality of the PMBP solution is tightly coupled with the local window size, over which the raw data cost is aggregated to mitigate ambiguity in the data constraint.
Most conventional structure-from-motion (SFM) techniques require camera pose estimation before computing any scene structure.
In this paper, we propose a unified framework called PISA, which stands for Pixelwise Image Saliency Aggregating various bottom-up cues and priors.
Most existing works heavily rely on object / part detectors to build the correspondence between object parts by using object or object part annotations inside training images.
Fundamental challenges to such an image or scene alignment task are often multifold, which render many existing techniques fall short of producing dense correspondences robustly and efficiently.
By fusing complementary contrast measures in such a pixelwise adaptive manner, the detection effectiveness is significantly boosted.
Recent studies on fast cost volume filtering based on efficient edge-aware filters have provided a fast alternative to solve discrete labeling problems, with the complexity independent of the support window size.