Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images.
We also demonstrate how our system can be used by quickly scanning and building a model of a novel object, which can immediately be used by our method for pose estimation.
Learning to hallucinate additional examples has recently been shown as a promising direction to address few-shot learning tasks.
Visual data in autonomous driving perception, such as camera image and LiDAR point cloud, can be interpreted as a mixture of two aspects: semantic feature and geometric structure.
We propose a model-based approach to enable RL agents to effectively explore the environment with unknown system dynamics and environment constraints given a significantly small number of violation budgets.
Recent works have proposed to solve this task by augmenting the training data of the few-shot classes using generative models with the few-shot training samples as the seeds.
Motivated by the human ability to solve this task, models have been developed that transfer knowledge from classes with many examples to learn classes with few examples.
We propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints.
We propose a simple, fast, and flexible framework to generate simultaneously semantic and instance masks for panoptic segmentation.
Multi-agent navigation in dynamic environments is of great industrial value when deploying a large scale fleet of robot to real-world applications.
Our first method, which regresses from deep learned features to an isotropic Bingham distribution, gives the best performance for orientation distribution estimation for non-symmetric objects.
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pretraining.
One of the key challenges in the semantic mapping problem in postdisaster environments is how to analyze a large amount of data efficiently with minimal supervision.
Semantic segmentation with Convolutional Neural Networks is a memory-intensive task due to the high spatial resolution of feature maps and output predictions.
One of their remarkable properties is the ability to transfer knowledge from a large source dataset to a (typically smaller) target dataset.
Humans can robustly learn novel visual concepts even when images undergo various deformations and lose certain information.
We demonstrate our ability to learn MVS without 3D supervision using a real dataset, and show that each component of our proposed robust loss results in a significant improvement.
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
Recently, neural networks operating on point clouds have shown superior performance on 3D understanding tasks such as shape classification and part segmentation.
With our design, the network progressively learns features specific to the target domain using annotation from only the source domain.
Shape completion, the problem of estimating the complete geometry of objects from partial observations, lies at the core of many vision and robotics applications.
Ranked #3 on Point Cloud Completion on ShapeNet
Often, multiple cameras are used for cross-spectral imaging, thus requiring image alignment, or disparity estimation in a stereo setting.
Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views.
We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.
We cast this problem as transfer learning, where knowledge from the data-rich classes in the head of the distribution is transferred to the data-poor classes in the tail.
We seek to combine the advantages of RNNs and PSRs by augmenting existing state-of-the-art recurrent neural networks with Predictive-State Decoders (PSDs), which add supervision to the network's internal state representation to target predicting future observations.
We demonstrate that this method is able to remove uninformative parts of the feature space for the anomaly detection setting.
Experimentally, the adaptive weights induce more competitive anytime predictions on multiple recognition data-sets and models than non-adaptive approaches including weighing all losses equally.
First we explicitly model the high level structure of active objects in the scene---humans---and use a VAE to model the possible future movements of humans in the pose space.
To generalize from batch to online, we first introduce the definition of online weak learning edge with which for strongly convex and smooth loss functions, we present an algorithm, Streaming Gradient Boosting (SGB) with exponential shrinkage guarantees in the number of weak learners.
Given a query image, a second positive image and a third negative image, dissimilar to the first two images, we define a contextualized similarity search criteria.
The rational camera model recently introduced in  provides a general methodology for studying abstract nonlinear imaging systems and their multi-view geometry.
Inspired by the transferability properties of CNNs, we introduce an additional unsupervised meta-training stage that exposes multiple top layer units to a large amount of unlabeled real-world images.
We address an anomaly detection setting in which training sequences are unavailable and anomalies are scored independently of temporal ordering.
The ability to transfer knowledge gained in previous tasks into new contexts is one of the most important mechanisms of human learning.
As robots aspire for long-term autonomous operations in complex dynamic environments, the ability to reliably take mission-critical decisions in ambiguous situations becomes critical.
We show that our method is able to successfully predict events in a wide variety of scenes and can produce multiple different predictions when the future is ambiguous.
Silhouettes provide rich information on three-dimensional shape, since the intersection of the associated visual cones generates the "visual hull", which encloses and approximates the original shape.
With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN).
Ranked #34 on Self-Supervised Action Recognition on HMDB51
Our contributions on network design and rotation invariance offer insights nonspecific to motion estimation.
We present a simple approach for producing a small number of structured visual outputs which have high recall, for a variety of tasks including monocular pose estimation and semantic scene segmentation.
In this paper, we explore an approach to generating detectors that is radically different from the conventional way of learning a detector from a large corpus of annotated positive and negative data samples.
Because our CNN model makes no assumptions about the underlying scene, it can predict future optical flow on a diverse set of scenarios.
Cameras provide a rich source of information while being passive, cheap and lightweight for small and medium Unmanned Aerial Vehicles (UAVs).
We present an efficient algorithm with provable performance for building a high-quality list of detections from any candidate set of region-based proposals.
We theoretically guarantee that our algorithms achieve near-optimal linear predictions at each budget when a feature group is chosen.
In this paper we present a conceptually simple but surprisingly powerful method for visual prediction which combines the effectiveness of mid-level visual elements with temporal modeling.
We show that a surprisingly straightforward and general approach, that we call ALERT, can predict the likely accuracy (or failure) of a variety of computer vision systems semantic segmentation, vanishing point and camera parameter estimation, and image memorability prediction on individual input images.
Structured prediction plays a central role in machine learning applications from computational biology to computer vision.
When applied to MAP inference, the algorithm is a parallel extension of Iterated Conditional Modes (ICM) with climbing and convergence properties that make it a compelling alternative to the sequential ICM.