Recent novel view synthesis methods obtain promising results for relatively small scenes, e. g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input.
We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE).
Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity.
Text-to-image diffusion models have made significant advances in generating and editing high-quality images.
We propose a new dataset and a novel approach to learning hand-object interaction priors for hand and articulated object pose estimation.
A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding.
Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars.
Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric.
To the best of our knowledge, our work is the first mobile solution for face motion deblurring that works reliably and robustly over thousands of images in diverse motion and lighting conditions.
Our method works on in-the-wild online image collections of an object and produces relightable 3D assets for several use-cases such as AR/VR.
Transformers have been widely used in numerous vision problems especially for visual recognition and detection.
While recent face anti-spoofing methods perform well under the intra-domain setups, an effective approach needs to account for much larger appearance variations of images acquired in complex scenes with different sensors for robust performance.
Our newly trained RAFT achieves an Fl-all score of 4. 31% on KITTI 2015, more accurate than all published optical flow methods at the time of writing.
1 code implementation • • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
Recent methods use multiple networks to estimate optical flow or depth and a separate network dedicated to frame synthesis.
Ranked #2 on Video Frame Interpolation on Xiph-4k
Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight.
The surface embeddings are implemented as coordinate-based MLPs that are fit to each video via consistency and contrastive reconstruction losses. Experimental results show that ViSER compares favorably against prior work on challenging videos of humans with loose clothing and unusual poses as well as animals videos from DAVIS and YTVOS.
In this work, we present pyramid adversarial training (PyramidAT), a simple and effective technique to improve ViT's overall performance.
Ranked #6 on Domain Generalization on ImageNet-C (using extra training data)
Transformers are transforming the landscape of computer vision, especially for recognition tasks.
Ranked #11 on Object Detection on COCO 2017 val
Remarkable progress has been made in 3D reconstruction of rigid structures from a video or a collection of images.
Synthetic datasets play a critical role in pre-training CNN models for optical flow, but they are painstaking to generate and hard to adapt to new applications.
By integrating the SGC and GPA together, we propose the Adaptive Superpixel-guided Network (ASGNet), which is a lightweight model and adapts to object scale and shape variation.
Ranked #55 on Few-Shot Semantic Segmentation on COCO-20i (5-shot)
In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses.
End-to-end deep learning methods have advanced stereo vision in recent years and obtained excellent results when the training and test data are similar.
Cost volume is an essential component of recent deep models for optical flow estimation and is usually constructed by calculating the inner product between two feature vectors.
Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition.
We introduce a compact network for holistic scene flow estimation, called SENSE, which shares common encoder features among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion estimation, and semantic segmentation.
We present a simple and effective image super-resolution algorithm that imposes an image formation constraint on the deep neural networks via pixel substitution.
Despite the long history of image and video stitching research, existing academic and commercial solutions still produce strong artifacts.
We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model.
Ranked #1 on Video Frame Interpolation on UCF101 (PSNR (sRGB) metric)
In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements.
To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account.
We investigate two crucial and closely related aspects of CNNs for optical flow estimation: models and training.
Ranked #5 on Optical Flow Estimation on KITTI 2012
Specifically, we first exploit Convolutional Neural Networks to estimate the relative depth and portrait segmentation maps from a single input image.
We propose a simple and effective discriminative framework to learn data terms that can adaptively handle blurred images in the presence of severe noise and outliers.
Superpixels provide an efficient low/mid-level representation of image data, which greatly reduces the number of image primitives for subsequent vision tasks.
Specifically, we propose a new loss function that takes the segmentation error into account for affinity learning.
We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions.
Ranked #52 on Monocular Depth Estimation on KITTI Eigen split
These problems usually involve the estimation of two components of the target signals: structures and details.
Estimation of 3D motion in a dynamic scene from a temporal pair of images is a core task in many scene understanding problems.
We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice.
Ranked #25 on Semantic Segmentation on ScanNet (test mIoU metric)
Specifically, we target a streaming setting where the videos to be streamed from a server to a client are all in the same domain and they have to be compressed to a small size for low-latency transmission.
Finally, the two input images are warped and linearly fused to form each intermediate frame.
We present an algorithm to directly restore a clear high-resolution image from a blurry low-resolution input.
It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow.
Ranked #3 on Dense Pixel Correspondence Estimation on HPatches
Given two consecutive frames from a pair of stereo cameras, 3D scene flow methods simultaneously estimate the 3D geometry and motion of the observed scene.
Therefore, enforcing the sparsity of the dark channel helps blind deblurring on various scenarios, including natural, face, text, and low-illumination images.
Ranked #7 on Deblurring on RealBlur-R (trained on GoPro)
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow.
As consumer depth sensors become widely available, estimating scene flow from RGBD sequences has received increasing attention.
To handle such situations, we propose a local layering model where motion and occlusion relationships are inferred jointly.
Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another.
We present a new probabilistic model of optical flow in layers that addresses many of the shortcomings of previous approaches.