We present an approach for full 3D scene reconstruction from a single unseen image.
Recent advancements in differentiable rendering and 3D reasoning have driven exciting results in novel view synthesis from a single image.
Aligning partial views of a scene into a single whole is essential to understanding one's environment and is a key component of numerous robotics tasks such as SLAM and SfM.
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining.
We address these challenges by introducing PyTorch3D, a library of modular, efficient, and differentiable operators for 3D deep learning.
The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet.
Ranked #1 on Object Detection on COCO test-dev (Hardware Burden metric)
In this paper, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models.
Ranked #1 on Audio Question Answering on DAQA
The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
Ranked #3 on Visual Reasoning on PHYRE-1B-Within
We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object.
Ranked #1 on 3D Shape Modeling on Pix3D S2
Compared to current methodologies of comparing point and curve estimates of model families, distribution estimates paint a more complete picture of the entire design landscape.
We show that these encodings are competitive with existing data hiding algorithms, and further that they can be made robust to noise: our models learn to reconstruct hidden information in an encoded image despite the presence of Gaussian blurring, pixel-wise dropout, cropping, and JPEG compression.
To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships.
Ranked #4 on Layout-to-Image Generation on Visual Genome 64x64
We present a novel Dynamic Differentiable Reasoning (DDR) framework for jointly learning branching programs and the functions composing them; this resolves a significant nondifferentiability inhibiting recent dynamic architectures.
Ranked #8 on Visual Question Answering on CLEVR
Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments.
Ranked #4 on Trajectory Prediction on ETH
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
Ranked #5 on Visual Question Answering on CLEVR-Humans
Recent progress in style transfer on images has focused on improving the quality of stylized images and speed of methods.
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.
We consider image transformation problems, where an input image is transformed into an output image.
Ranked #4 on Nuclear Segmentation on Cell17
no code implementations • 23 Feb 2016 • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering.
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language.
Ranked #2 on Object Detection on Visual Genome
Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata.
Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data.
We introduce a novel dataset of 5, 000 human-generated scene graphs grounded to images and use this dataset to evaluate our method for image retrieval.