We propose Differentiable Stereopsis, a multi-view stereo approach that reconstructs shape and texture from few input views and noisy cameras.
We address these challenges by introducing PyTorch3D, a library of modular, efficient, and differentiable operators for 3D deep learning.
When a toddler is presented a new toy, their instinctual behaviour is to pick it upand inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with.
We introduce a new memory architecture, Bayesian Relational Memory (BRM), to improve the generalization ability for semantic visual navigation agents in unseen environments, where an agent is given a semantic target to navigate towards.
We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object.
Ranked #1 on 3D Shape Modeling on Pix3D S2
To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.
To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering  in photo-realistic environments (Matterport 3D).
We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning.
Building deep reinforcement learning agents that can generalize and adapt to unseen environments remains a fundamental challenge for AI.
To generalize to unseen environments, an agent needs to be robust to low-level variations (e. g. color, texture, object changes), and also high-level variations (e. g. layout changes of the environment).
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #7 on Keypoint Detection on COCO test-challenge
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data.
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?").
Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with.
Ranked #32 on Human-Object Interaction Detection on HICO-DET
Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
Ranked #1 on Real-Time Object Detection on COCO minival (MAP metric)
In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system.
Ranked #4 on Weakly Supervised Object Detection on HICO-DET
We present convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images.
A k-poselet is a deformable part model (DPM) with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations.
We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image.