We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.
Ranked #1 on Person-centric Visual Grounding on Who’s Waldo (using extra training data)
The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos.
Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images.
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting, where the images have little or no overlap.
We cast this as the problem of aligning a source 3D object to a target 3D object from the same object category.
Recent works have shown exciting results in unsupervised image de-rendering -- learning to decompose 3D shape, appearance, and lighting from single-image collections without explicit supervision.
We present PhySG, an end-to-end inverse rendering pipeline that includes a fully differentiable renderer and can reconstruct geometry, materials, and illumination from scratch from a set of RGB input images.
We present a framework for automatically reconfiguring images of street scenes by populating, depopulating, or repopulating them with objects such as pedestrians or vehicles.
Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes.
We introduce the problem of perpetual view generation - long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image.
Important ethical concerns arising from computer vision datasets of people have been receiving significant attention, and a number of datasets have been withdrawn as a result.
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input.
We consider two important aspects in understanding and editing images: modeling regular, program-like texture or patterns in 2D planes, and 3D posing of these planes in the scene.
Neural Radiance Fields (NeRF) achieve impressive view synthesis results for a variety of capture settings, including 360 capture of bounded scenes and forward-facing capture of bounded and unbounded scenes.
Predicting where people can walk in a scene is important for many tasks, including autonomous driving systems and human behavior analysis.
Point cloud generation thus amounts to moving randomly sampled points to high-density areas.
We propose a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors.
Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$.
Neural implicit shape representations are an emerging paradigm that offers many potential benefits over conventional discrete representations, including memory efficiency at a high spatial resolution.
Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks.
Appropriate and timely deployment of disease management depends on early disease detection.
A recent strand of work in view synthesis uses deep learning to generate multiplane images (a camera-centric, layered 3D representation) given two or more input images at known viewpoints.
We present a deep learning solution for estimating the incident illumination at any 3D location within a scene from an input narrow-baseline stereo image pair.
We introduce UprightNet, a learning-based approach for estimating 2DoF camera orientation from a single RGB image of an indoor scene.
We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to $4\times$ the lateral viewpoint movement allowed by prior work.
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving.
Starting from internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud.
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task.
Intrinsic image decomposition is a challenging, long-standing computer vision problem for which ground truth data is very difficult to acquire.
We demonstrate this framework on 3D pose estimation by proposing a differentiable objective that seeks the optimal set of keypoints for recovering the relative pose between two views of an object.
The view synthesis problem--generating novel views of a scene from known imagery--has garnered recent attention due in part to compelling applications in virtual and augmented reality.
We validate the use of large amounts of Internet data by showing that models trained on MegaDepth exhibit strong generalization-not only to novel scenes, but also to other diverse datasets including Make3D, KITTI, and DIW, even when no images from those datasets are seen during training.
We demonstrate the value of our data and network in an application to intrinsic images, where we can reduce decomposition artifacts produced by existing algorithms.
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences.
We propose Deep Feature Interpolation (DFI), a new data-driven baseline for automatic high-resolution image transformation.
We propose a new neural network architecture for solving single-image analogies - the generation of an entire set of stylistically similar images from just a single input image.
We propose a new method for turning an Internet-scale corpus of categorized images into a small set of human-interpretable discriminative visual elements using powerful tools based on deep learning.
To our knowledge, our work is the first to apply deep learning to the problem of new view synthesis from sets of real-world, natural imagery.
In this paper, we introduce a new, large-scale, open dataset of materials in the wild, the Materials in Context Database (MINC), and combine this dataset with deep learning to achieve material recognition and segmentation of images in the wild.