Search Results for author: Aniruddha Kembhavi

Found 78 papers, 41 papers with code

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

no code implementations11 Dec 2024 Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions.

Text to 3D

One Diffusion to Generate Them All

1 code implementation25 Nov 2024 Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu

Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset.

Camera Pose Estimation Deblurring +4

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

no code implementations28 Jun 2024 Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation.

Decoder Object +1

CodeNav: Beyond tool-use to using real-world codebases with LLM agents

no code implementations18 Jun 2024 Tanmay Gupta, Luca Weihs, Aniruddha Kembhavi

We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries.

Task Me Anything

1 code implementation17 Jun 2024 Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna

As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case.

2k Attribute +3

Preserving Identity with Variational Score for General-purpose 3D Editing

no code implementations13 Jun 2024 Duong H. Le, Tuan Pham, Aniruddha Kembhavi, Stephan Mandt, Wei-Chiu Ma, Jiasen Lu

We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models.

Denoising

Seeing the Unseen: Visual Common Sense for Semantic Placement

no code implementations CVPR 2024 Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image.

Common Sense Reasoning Object

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

1 code implementation28 Dec 2023 Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action.

Decoder Image Generation +1

Harmonic Mobile Manipulation

no code implementations11 Dec 2023 Ruihan Yang, Yejin Kim, Rose Hendrix, Aniruddha Kembhavi, Xiaolong Wang, Kiana Ehsani

Recent advancements in robotics have enabled robots to navigate complex scenes or manipulate diverse objects independently.

Navigate

Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing

1 code implementation29 Nov 2023 Piper Wolters, Favyen Bastani, Aniruddha Kembhavi

Super-Resolution for remote sensing has the potential for huge impact on planet monitoring by producing accurate and realistic high resolution imagery on a frequent basis and a global scale.

Super-Resolution

MIMIC: Masked Image Modeling with Image Correspondences

1 code implementation27 Jun 2023 Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna

We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric tasks: depth estimation on NYUv2 (1. 7%), and surface normals estimation on Taskonomy (2. 05%).

Depth Estimation Pose Estimation +3

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

2 code implementations NeurIPS 2023 Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, Ranjay Krishna

In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-language models have permeated the machine learning ecosystem.

Neural Priming for Sample-Efficient Adaptation

1 code implementation NeurIPS 2023 Matthew Wallingford, Vivek Ramanujan, Alex Fang, Aditya Kusupati, Roozbeh Mottaghi, Aniruddha Kembhavi, Ludwig Schmidt, Ali Farhadi

Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks.

Transfer Learning

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

1 code implementation28 Mar 2023 Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support.

Neural Radiance Field Codebooks

1 code implementation10 Jan 2023 Matthew Wallingford, Aditya Kusupati, Alex Fang, Vivek Ramanujan, Aniruddha Kembhavi, Roozbeh Mottaghi, Ali Farhadi

Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks.

Object Representation Learning +1

EXCALIBUR: Encouraging and Evaluating Embodied Exploration

no code implementations CVPR 2023 Hao Zhu, Raghav Kapoor, So Yeon Min, Winson Han, Jiatai Li, Kaiwen Geng, Graham Neubig, Yonatan Bisk, Aniruddha Kembhavi, Luca Weihs

Humans constantly explore and learn about their environment out of curiosity, gather information, and update their models of the world.

Scene Graph Contrastive Learning for Embodied Navigation

no code implementations ICCV 2023 Kunal Pratap Singh, Jordi Salvador, Luca Weihs, Aniruddha Kembhavi

Training effective embodied AI agents often involves expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization.

Contrastive Learning Representation Learning

A General Purpose Supervisory Signal for Embodied Agents

no code implementations1 Dec 2022 Kunal Pratap Singh, Jordi Salvador, Luca Weihs, Aniruddha Kembhavi

Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization.

Contrastive Learning Representation Learning

SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding

1 code implementation ICCV 2023 Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, Aniruddha Kembhavi

Remote sensing images are useful for a wide variety of planet monitoring applications, from tracking deforestation to tackling illegal fishing.

Time Series Time Series Analysis

Visual Programming: Compositional visual reasoning without training

1 code implementation CVPR 2023 Tanmay Gupta, Aniruddha Kembhavi

We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions.

In-Context Learning Question Answering +2

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

1 code implementation ICCV 2023 Sophia Gu, Christopher Clark, Aniruddha Kembhavi

We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images.

Image Captioning Question Answering +2

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

no code implementations17 Jun 2022 Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing.

Depth Estimation Image Generation +12

What do navigation agents learn about their environment?

1 code implementation CVPR 2022 Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, Roozbeh Mottaghi

We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment.

Visual Navigation

GRIT: General Robust Image Task Benchmark

1 code implementation28 Apr 2022 Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem

Computer vision models excel at making predictions when the test distribution closely resembles the training distribution.

Instance Segmentation Keypoint Detection +7

Object Manipulation via Visual Target Localization

no code implementations15 Mar 2022 Kiana Ehsani, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi

Object manipulation is a critical skill required for Embodied AI agents interacting with the world around them.

Object object-detection +1

ASC me to Do Anything: Multi-task Training for Embodied AI

no code implementations14 Feb 2022 Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi

We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks.

Webly Supervised Concept Expansion for General Purpose Vision Models

no code implementations4 Feb 2022 Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi

This work presents an effective and inexpensive alternative: learn skills from supervised datasets, learn concepts from web image search, and leverage a key characteristic of GPVs: the ability to transfer visual knowledge across skills.

Human-Object Interaction Detection Image Retrieval +4

Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture

no code implementations CVPR 2022 Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process.

Question Answering Visual Question Answering

Container: Context Aggregation Networks

2 code implementations NeurIPS 2021 Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Inductive Bias Instance Segmentation +4

Simple but Effective: CLIP Embeddings for Embodied AI

2 code implementations CVPR 2022 Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha Kembhavi

Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation.

Image Manipulation Navigate

RobustNav: Towards Benchmarking Robustness in Embodied Navigation

1 code implementation ICCV 2021 Prithvijit Chattopadhyay, Judy Hoffman, Roozbeh Mottaghi, Aniruddha Kembhavi

As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual - affecting RGB inputs - and dynamics - affecting transition dynamics - corruptions.

Benchmarking Data Augmentation +1

Container: Context Aggregation Network

4 code implementations2 Jun 2021 Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations.

Image Classification Inductive Bias +5

ManipulaTHOR: A Framework for Visual Object Manipulation

1 code implementation CVPR 2021 Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha Kembhavi, Roozbeh Mottaghi

Object manipulation is an established research domain within the robotics community and poses several challenges including manipulator motion, grasping and long-horizon planning, particularly when dealing with oft-overlooked practical setups involving visually rich and complex scenes, manipulation using mobile agents (as opposed to tabletop manipulation), and generalization to unseen environments and objects.

Object

GridToPix: Training Embodied Agents with Minimal Supervision

no code implementations ICCV 2021 Unnat Jain, Iou-Jen Liu, Svetlana Lazebnik, Aniruddha Kembhavi, Luca Weihs, Alexander Schwing

While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards.

Deep Reinforcement Learning PointGoal Navigation +2

Visual Semantic Role Labeling for Video Understanding

1 code implementation CVPR 2021 Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha Kembhavi

We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling.

Semantic Role Labeling Video Recognition +1

Towards General Purpose Vision Systems

2 code implementations1 Apr 2021 Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process.

Question Answering Visual Question Answering

Visual Room Rearrangement

2 code implementations CVPR 2021 Luca Weihs, Matt Deitke, Aniruddha Kembhavi, Roozbeh Mottaghi

We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects' initial configurations.

Navigate

Learning Flexible Visual Representations via Interactive Gameplay

no code implementations ICLR 2021 Luca Weihs, Aniruddha Kembhavi, Kiana Ehsani, Sarah M Pratt, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk, Roozbeh Mottaghi, Ali Farhadi

A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making and socialization.

Decision Making Representation Learning

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

1 code implementation EMNLP 2020 Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Image Captioning Image Generation +3

AllenAct: A Framework for Embodied AI Research

1 code implementation28 Aug 2020 Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, Aniruddha Kembhavi

The domain of Embodied AI, in which agents learn to complete tasks through interaction with their environment from egocentric observations, has experienced substantial growth with the advent of deep reinforcement learning and increased interest from the computer vision, NLP, and robotics communities.

Deep Reinforcement Learning Embodied Question Answering +2

Bridging the Imitation Gap by Adaptive Insubordination

no code implementations NeurIPS 2021 Luca Weihs, Unnat Jain, Iou-Jen Liu, Jordi Salvador, Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing

However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results.

Imitation Learning Memorization +3

FLUID: A Unified Evaluation Framework for Flexible Sequential Data

2 code implementations6 Jul 2020 Matthew Wallingford, Aditya Kusupati, Keivan Alizadeh-Vahid, Aaron Walsman, Aniruddha Kembhavi, Ali Farhadi

To foster research towards the goal of general ML methods, we introduce a new unified evaluation framework - FLUID (Flexible Sequential Data).

Continual Learning Representation Learning +1

Supermasks in Superposition

2 code implementations NeurIPS 2020 Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi

We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting.

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

3 code implementations23 Jun 2020 Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, Erik Wijmans

In particular, the agent is initialized at a random location and pose in an environment and asked to find an instance of an object category, e. g., find a chair, by navigating to it.

Object

Feel The Music: Automatically Generating A Dance For An Input Song

1 code implementation21 Jun 2020 Purva Tendulkar, Abhishek Das, Aniruddha Kembhavi, Devi Parikh

We encode intuitive, flexible heuristics for what a 'good' dance is: the structure of the dance should align with the structure of the music.

Learning About Objects by Learning to Interact with Them

no code implementations NeurIPS 2020 Martin Lohmann, Jordi Salvador, Aniruddha Kembhavi, Roozbeh Mottaghi

Much of the remarkable progress in computer vision has been focused around fully supervised learning mechanisms relying on highly curated datasets for a variety of tasks.

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

1 code implementation CVPR 2020 Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, Ali Farhadi

We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems.

Grounded Situation Recognition

1 code implementation ECCV 2020 Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Aniruddha Kembhavi

We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e. g. agent, tool), and bounding-box groundings of entities.

Grounded Situation Recognition Image Retrieval +1

Learning Generalizable Visual Representations via Interactive Gameplay

no code implementations17 Dec 2019 Luca Weihs, Aniruddha Kembhavi, Kiana Ehsani, Sarah M Pratt, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk, Roozbeh Mottaghi, Ali Farhadi

A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization.

Decision Making Reinforcement Learning +1

ELASTIC: Improving CNNs with Dynamic Scaling Policies

1 code implementation CVPR 2019 Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan Yuille, Mohammad Rastegari

We formulate the scaling policy as a non-linear function inside the network's structure that (a) is learned from data, (b) is instance specific, (c) does not add extra computation, and (d) can be applied on any network architecture.

General Classification Multi-Label Classification +1

Imagine This! Scripts to Compositions to Videos

5 code implementations ECCV 2018 Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, Aniruddha Kembhavi

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge.

Retrieval World Knowledge

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

1 code implementation CVPR 2018 Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi

Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively).

Question Answering Visual Question Answering

Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension

no code implementations CVPR 2017 Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, Hannaneh Hajishirzi

Our analysis shows that a significant portion of questions require complex parsing of the text and the diagrams and reasoning, indicating that our dataset is more complex compared to previous machine comprehension and visual question answering datasets.

Question Answering Reading Comprehension +1

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

no code implementations26 Apr 2017 Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, Devi Parikh

Finally, we evaluate several existing VQA models under this new setting and show that the performances of these models degrade by a significant amount compared to the original VQA setting.

Question Answering Visual Question Answering

Bidirectional Attention Flow for Machine Comprehension

27 code implementations5 Nov 2016 Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi

Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.

Cloze Test Navigate +2

A Diagram Is Worth A Dozen Images

1 code implementation24 Mar 2016 Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi

We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering.

Visual Question Answering (VQA)

Cannot find the paper you are looking for? You can Submit a new open access paper.