We propose a deep convolutional neural network (CNN) to estimate surface normal from a single color image accompanied with a low-quality depth channel.
Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations.
Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations.
Ranked #1 on Video Quality Assessment on YouTube-UGC (using extra training data)
Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues.
On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up.
Ranked #2 on Video Inpainting on DAVIS
We present a novel approach to reference-based super-resolution (RefSR) with the focus on dual-camera super-resolution (DCSR), which utilizes reference images for high-quality and high-fidelity results.
Due to the lack of a large-scale reflection removal dataset with diverse real-world scenes, many existing reflection removal methods are trained on synthetic data plus a small amount of real-world data, which makes it difficult to evaluate the strengths or weaknesses of different reflection removal methods thoroughly.
Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-the-art video inpainting approaches with significant boosted efficiency.
In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming.
For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs.
In this paper, we study the problem of real-scene single image super-resolution to bridge the gap between synthetic data and real captured images.
Though synthetic dataset is proposed to fill the gaps of large data demand, the fine-tuning on real dataset is still needed due to the domain variances between synthetic data and real data.
We present a spatial pixel aggregation network and learn the pixel sampling and averaging strategies for image denoising.
In this work, we further improve the performance of QVI from three facets and propose an enhanced quadratic video interpolation (EQVI) model.
By incorporating the depth map, our approach is able to extrapolate realistic high-frequency effects under novel lighting via geometry guided image decomposition from the flashlight image, and predict the cast shadow map from the shadow-encoding transformed depth map.
A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
Ranked #1 on 3D Object Reconstruction on Data3D−R2N2
In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information.
Ranked #1 on Point Cloud Completion on Completion3D
Large-scale synthetic datasets are beneficial to stereo matching but usually introduce known domain bias.
We present a novel formulation to removing reflection from polarized images in the wild.
Inferring the 3D shape of an object from an RGB image has shown impressive results, however, existing methods rely primarily on recognizing the most similar 3D model from the training set to solve the problem.
Recently, it is increasingly popular to equip mobile RGB cameras with Time-of-Flight (ToF) sensors for active depth sensing.
Most existing super-resolution methods do not perform well in real scenarios due to lack of realistic training data and information loss of the model input.
The growing availability of commodity RGB-D cameras has boosted the applications in the field of scene understanding.
Our method is evaluated on both real-istic and synthetic stereo image pairs, and produces supe-rior results compared to the calibrated rectification or otherself-rectification approaches
Intuitively, the vari-ance in the Laplacian distribution is large for low confidentpixels while small for high-confidence pixels.
While significant progress has been made in monocular depth estimation with Convolutional Neural Networks (CNNs) extracting absolute features, such as edges and textures, the depth constraint of neighboring pixels, namely relative features, has been mostly ignored by recent methods.
In this work, we combine the robustness merit of model-based approaches and the learning power of data-driven approaches for real image denoising.
By feeding real stereo pairs of different domains to stereo models pre-trained with synthetic data, we see that: i) a pre-trained model does not generalize well to the new domain, producing artifacts at boundaries and ill-posed regions; however, ii) feeding an up-sampled stereo pair leads to a disparity map with extra details.
The resulting model outperforms all the previous monocular depth estimation methods as well as the stereo block matching method in the challenging KITTI dataset by only using a small number of real training data.
Ranked #18 on Monocular Depth Estimation on KITTI Eigen split (using extra training data)
Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e. g. motion blur and occlusion) as well as the inability of capturing the temporal correlation among video frames.
Ranked #2 on Pose Estimation on J-HMDB
In this paper, we introduce a bilinear composition loss function to address the problem of image dehazing.
As opposed to directly learning the disparity at the second stage, we show that residual learning provides more effective refinement.
Recent advances in visual tracking showed that deep Convolutional Neural Networks (CNN) trained for image classification can be strong feature extractors for discriminative trackers.
In this paper, we proposed a novel single stage end-to-end trainable object detection network to overcome this limitation.
This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable.
In this paper, we draw on Shepard interpolation and design Shepard Convolutional Neural Networks (ShCNN) which efficiently realizes end-to-end trainable TVI operators in the network.