In applications such as optical see-through and projector augmented reality, producing images amounts to solving non-negative image generation, where one can only add light to an existing image.
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting, where the images have little or no overlap.
With our novel learning objective, our framework can learn high-level semantic concepts.
Ranked #2 on Unsupervised Semantic Segmentation on COCO-Stuff
Self-driving cars must detect other vehicles and pedestrians in 3D to plan safe routes and avoid collisions.
Few-shot learning is based on the premise that labels are expensive, especially when they are fine-grained and require expertise.
Point cloud generation thus amounts to moving randomly sampled points to high-density areas.
Existing approaches to depth or disparity estimation output a distribution over a set of pre-defined discrete values.
In the domain of autonomous driving, deep learning has substantially improved the 3D object detection accuracy for LiDAR and stereo camera data alike.
Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks.
In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes).
There has been little to no work with these methods on other smaller domains, such as satellite, textural, or biological imagery.
Reliable and accurate 3D object detection is a necessity for safe autonomous driving.
Few-shot, fine-grained classification requires a model to learn subtle, fine-grained distinctions between different classes (e. g., birds) based on a few images alone.
We obtain 2-D segmentation predictions by applying Mask-RCNN to the RGB image, and then link this image to a 3-D lidar point cloud by building a graph of connections among 3-D points and 2-D pixels.
To address this problem, we present a new model architecture that reframes single-view 3D reconstruction as learnt, category agnostic refinement of a provided, category-specific prior.
Specifically, we learn a two-level hierarchy of distributions where the first level is the distribution of shapes and the second level is the distribution of points given a shape.
Ranked #2 on Point Cloud Generation on ShapeNet Car
We present a technique to improve the transferability of deep representations learned on small labeled datasets by introducing self-supervised tasks as auxiliary loss functions.
In this paper we provide substantial advances to the pseudo-LiDAR framework through improvements in stereo depth estimation.
Traditional recognition methods typically require large, artificially-balanced training classes, while few-shot learning methods are tested on artificially small ones.
However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference.
Such information analyzing process is called abstracting, which recognize similarities or differences across all the garments and collections.
Not all people are equally easy to identify: color statistics might be enough for some cases while others might require careful reasoning about high- and low-level details.
Ranked #11 on Person Re-Identification on CUHK03 detected
Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views.
This paper considers the problem of inferring image labels from images when only a few annotated examples are available at training time.
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
Ranked #5 on Visual Question Answering on CLEVR-Humans
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.
Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature.
Feature pyramids are a basic component in recognition systems for detecting objects at different scales.
Ranked #3 on Pedestrian Detection on TJU-Ped-campus
Existing methods for pixel-wise labelling tasks generally disregard the underlying structure of labellings, often leading to predictions that are visually implausible.
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation.
Unlike classical semantic segmentation, we require individual object instances.
Ranked #3 on Object Detection on PASCAL VOC 2012
We present convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images.
A k-poselet is a deformable part model (DPM) with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations.