Probing Visual Language Priors in VLMs

Tiange Luo, Ang Cao, GunHee Lee, Justin Johnson, Honglak Lee

Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual language priors present in their training data rather than true visual reasoning.

Question Answering Visual Question Answering

Lightplane: Highly-Scalable Components for Neural 3D Fields

Ang Cao, Justin Johnson, Andrea Vedaldi, David Novotny

Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision.

3D Reconstruction

Probing the 3D Awareness of Visual Foundation Models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani

Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure?

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications.

3D Object Captioning Hallucination

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas

This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics.

Motion Synthesis valid

Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Nilesh Kulkarni, Linyi Jin, Justin Johnson, David F. Fouhey

We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data.

3D Reconstruction

Hyperbolic Image-Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs.

Image Classification Image-text Retrieval

Learning Visual Representations via Language-Guided Sampling

Mohamed El Banani, Karan Desai, Justin Johnson

Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters.

Contrastive Learning Representation Learning

Text-To-4D Dynamic Scene Generation

Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.

Scene Generation

HexPlane: A Fast Representation for Dynamic Scenes

Ang Cao, Justin Johnson

HexPlane is a simple and effective solution for representing 4D volumes, and we hope they can broadly contribute to modeling spacetime for dynamic 3D scenes.

Novel View Synthesis

Multiview Compressive Coding for 3D Reconstruction

Chao-yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos.

3D Reconstruction Decoder

Neural Shape Compiler: A Unified Framework for Transforming between Text, Point Cloud, and Program

Tiange Luo, Honglak Lee, Justin Johnson

On Text2Shape, ShapeGlot, ABO, Genre, and Program Synthetic datasets, Neural Shape Compiler shows strengths in $\textit{Text}$ $\Longrightarrow$ $\textit{Point Cloud}$, $\textit{Point Cloud}$ $\Longrightarrow$ $\textit{Text}$, $\textit{Point Cloud}$ $\Longrightarrow$ $\textit{Program}$, and Point Cloud Completion tasks.

Point Cloud Completion

Self-Supervised Correspondence Estimation via Multiview Registration

Mohamed El Banani, Ignacio Rocco, David Novotny, Andrea Vedaldi, Natalia Neverova, Justin Johnson, Benjamin Graham

To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences.


RGB no more: Minimally-decoded JPEG Vision Transformers

Jeongsoo Park, Justin Johnson

However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks.

Data Augmentation

The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

Chris Rockwell, Justin Johnson, David F. Fouhey

We present a simple baseline for directly estimating the relative pose (rotation and translation, including scale) between two images.

Inductive Bias Pose Prediction

FWD: Real-time Novel View Synthesis with Forward Warping and Depth

Ang Cao, Chris Rockwell, Justin Johnson

Novel view synthesis (NVS) is a challenging task requiring systems to generate photorealistic images of scenes from new viewpoints, where both quality and speed are important for applications.

Novel View Synthesis

Learning 3D Object Shape and Layout without 3D Supervision

Georgia Gkioxari, Nikhila Ravi, Justin Johnson

A 3D scene consists of a set of objects, each with a shape and a layout giving their position in space.


RedCaps: web-curated image-text data created by the people, for the people

Karan Desai, Gaurav Kaul, Zubin Aysola, Justin Johnson

We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit.

PixelSynth: Generating a 3D-Consistent Experience from a Single Image

Chris Rockwell, David F. Fouhey, Justin Johnson

Recent advancements in differentiable rendering and 3D reasoning have driven exciting results in novel view synthesis from a single image.

Novel View Synthesis

Inverting and Understanding Object Detectors

Ang Cao, Justin Johnson

As a core problem in computer vision, the performance of object detection has improved drastically in the past few years.

Object object-detection

Bootstrap Your Own Correspondences

Mohamed El Banani, Justin Johnson

Our approach combines classic ideas from point cloud registration with more recent representation learning approaches.

Point Cloud Registration Representation Learning

Rethinking "Batch" in BatchNorm

Yuxin Wu, Justin Johnson

BatchNorm is a critical building block in modern convolutional neural networks.

UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering

Mohamed El Banani, Luya Gao, Justin Johnson

Aligning partial views of a scene into a single whole is essential to understanding one's environment and is a key component of numerous robotics tasks such as SLAM and SfM.

Point Cloud Registration

Accelerating 3D Deep Learning with PyTorch3D

Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, Georgia Gkioxari

We address these challenges by introducing PyTorch3D, a library of modular, efficient, and differentiable operators for 3D deep learning.

Autonomous Vehicles Deep Learning

VirTex: Learning Visual Representations from Textual Annotations

Karan Desai, Justin Johnson

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet.

 Ranked #1 on Object Detection on COCO test-dev (Hardware Burden metric)

General Classification Image Captioning

SynSin: End-to-end View Synthesis from a Single Image

Olivia Wiles, Georgia Gkioxari, Richard Szeliski, Justin Johnson

Single image view synthesis allows for the generation of new views of a scene given a single input image.

Novel View Synthesis

Temporal Reasoning via Audio Question Answering

Haytham M. Fayek, Justin Johnson

In this paper, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models.

Audio Question Answering Question Answering

PHYRE: A New Benchmark for Physical Reasoning

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, Ross Girshick

The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.

Visual Reasoning

Mesh R-CNN

Georgia Gkioxari, Jitendra Malik, Justin Johnson

We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object.

3D Shape Modeling

On Network Design Spaces for Visual Recognition

Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, Piotr Dollár

Compared to current methodologies of comparing point and curve estimates of model families, distribution estimates paint a more complete picture of the entire design landscape.

Neural Architecture Search

HiDDeN: Hiding Data With Deep Networks

Jiren Zhu, Russell Kaplan, Justin Johnson, Li Fei-Fei

We show that these encodings are competitive with existing data hiding algorithms, and further that they can be made robust to noise: our models learn to reconstruct hidden information in an encoded image despite the presence of Gaussian blurring, pixel-wise dropout, cropping, and JPEG compression.


Image Generation from Scene Graphs

Justin Johnson, Agrim Gupta, Li Fei-Fei

To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships.

Image Generation from Scene Graphs Layout-to-Image Generation

DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer

Joseph Suarez, Justin Johnson, Fei-Fei Li

We present a novel Dynamic Differentiable Reasoning (DDR) framework for jointly learning branching programs and the functions composing them; this resolves a significant nondifferentiability inhibiting recent dynamic architectures.

Question Answering Visual Question Answering

Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks

Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, Alexandre Alahi

Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments.

Collision Avoidance Motion Forecasting

Inferring and Executing Programs for Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.

Visual Question Answering (VQA) Visual Reasoning

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.

Question Answering Visual Question Answering

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Jonathan Krause, Justin Johnson, Ranjay Krishna, Li Fei-Fei

Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.

Dense Captioning Descriptive

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson, Andrej Karpathy, Li Fei-Fei

We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language.

Dense Captioning Image Captioning

Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Justin Johnson, Lamberto Ballan, Fei-Fei Li

Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata.

Visualizing and Understanding Recurrent Networks

Andrej Karpathy, Justin Johnson, Li Fei-Fei

Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data.

Image Retrieval Using Scene Graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, Li Fei-Fei

We introduce a novel dataset of 5, 000 human-generated scene graphs grounded to images and use this dataset to evaluate our method for image retrieval.

Image Retrieval Object Localization

