Search Results for author: Dhruv Batra

Found 175 papers, 83 papers with code

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

1 code implementation9 May 2024 Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner

This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains.

Representation Learning Scene Understanding

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

no code implementations CVPR 2024 Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi

The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images.

Navigate Visual Navigation

Seeing the Unseen: Visual Common Sense for Semantic Placement

no code implementations CVPR 2024 Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image.

Common Sense Reasoning Object

Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

no code implementations14 Dec 2023 Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Shibo Zhao, Yu Quan Chong, Chen Wang, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, Yonatan Bisk

Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i. e., foundation models) in research fields such as Natural Language Processing (NLP) and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of robotics, and also exploring (ii) what a robotics-specific foundation model would look like.

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

no code implementations6 Dec 2023 Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, Bernadette Bucher

Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors.

Language Modelling Navigate

Skill Transformer: A Monolithic Policy for Mobile Manipulation

no code implementations ICCV 2023 Xiaoyu Huang, Dhruv Batra, Akshara Rai, Andrew Szot

We present Skill Transformer, an approach for solving long-horizon robotic tasks by combining conditional sequence modeling and skill modularity.

Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations

no code implementations7 Aug 2023 Nirbhay Modhe, Qiaozi Gao, Ashwin Kalyan, Dhruv Batra, Govind Thattai, Gaurav Sukhatme

Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions.

Offline RL reinforcement-learning +1

Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

no code implementations CVPR 2024 Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, Manolis Savva

Surprisingly, we observe that agents trained on just 122 scenes from our dataset outperform agents trained on 10, 000 scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in real-world scanned environments.

Navigate ObjectGoal Navigation +1

Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second

1 code implementation CVPR 2023 Vincent-Pierre Berges, Andrew Szot, Devendra Singh Chaplot, Aaron Gokaslan, Roozbeh Mottaghi, Dhruv Batra, Eric Undersander

Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects - by navigating to an object, picking it up, navigating to a target location, and then placing the object at the target location.

Reinforcement Learning (RL)

Adaptive Coordination in Social Embodied Rearrangement

no code implementations31 May 2023 Andrew Szot, Unnat Jain, Dhruv Batra, Zsolt Kira, Ruta Desai, Akshara Rai

We present the task of "Social Rearrangement", consisting of cooperative everyday tasks like setting up the dinner table, tidying a house or unpacking groceries in a simulated multi-agent environment.

Diversity

AutoNeRF: Training Implicit Scene Representations with Autonomous Agents

1 code implementation21 Apr 2023 Pierre Marza, Laetitia Matignon, Olivier Simonin, Dhruv Batra, Christian Wolf, Devendra Singh Chaplot

Empirical results show that NeRFs can be trained on actively collected data using just a single episode of experience in an unseen environment, and can be used for several downstream robotic tasks, and that modular trained exploration models outperform other classical and end-to-end baselines.

Novel View Synthesis

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

no code implementations14 Mar 2023 Karmesh Yadav, Arjun Majumdar, Ram Ramrakhya, Naoki Yokoyama, Alexei Baevski, Zsolt Kira, Oleksandr Maksymets, Dhruv Batra

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules.

object-detection Object Detection +3

Emergence of Maps in the Memories of Blind Navigation Agents

no code implementations30 Jan 2023 Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra

A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial.

Inductive Bias PointGoal Navigation

PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav

1 code implementation CVPR 2023 Ram Ramrakhya, Dhruv Batra, Erik Wijmans, Abhishek Das

We find that BC$\rightarrow$RL on human demonstrations outperforms BC$\rightarrow$RL on SP and FE trajectories, even when controlled for same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies.

Imitation Learning Navigate +2

Cross-Domain Transfer via Semantic Skill Imitation

no code implementations14 Dec 2022 Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, Akshara Rai

We propose an approach for semantic imitation, which uses demonstrations from a source domain, e. g. human videos, to accelerate reinforcement learning (RL) in a different target domain, e. g. a robotic manipulator in a simulated kitchen.

Reinforcement Learning (RL) Robot Manipulation

Navigating to Objects in the Real World

no code implementations2 Dec 2022 Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, Devendra Singh Chaplot

In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality.

Navigate Visual Navigation

Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

no code implementations29 Nov 2022 Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, Devendra Singh Chaplot

We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image.

Visual Navigation

ViNL: Visual Navigation and Locomotion Over Obstacles

1 code implementation26 Oct 2022 Simar Kareer, Naoki Yokoyama, Dhruv Batra, Sehoon Ha, Joanne Truong

ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands.

Navigate Visual Navigation

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

1 code implementation24 Jun 2022 Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv Batra

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e. g., "find a sink").

Is Mapping Necessary for Realistic PointGoal Navigation?

1 code implementation CVPR 2022 Ruslan Partsey, Erik Wijmans, Naoki Yokoyama, Oles Dobosevych, Dhruv Batra, Oleksandr Maksymets

However, for PointNav in a realistic setting (RGB-D and actuation noise, no GPS+Compass), this is an open question; one we tackle in this paper.

Data Augmentation Navigate +3

Housekeep: Tidying Virtual Households using Commonsense Reasoning

1 code implementation22 May 2022 Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot, Harsh Agrawal

Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house.

Language Modelling Large Language Model

Episodic Memory Question Answering

no code implementations CVPR 2022 Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, Devi Parikh

Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an allocentric, top-down semantic feature map and grounds the question into the map to localize the answer.

AI Agent Question Answering

Offline Visual Representation Learning for Embodied Navigation

2 code implementations27 Apr 2022 Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, Oleksandr Maksymets

In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.

Representation Learning Self-Supervised Learning

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

no code implementations CVPR 2022 Ram Ramrakhya, Eric Undersander, Dhruv Batra, Abhishek Das

We present a large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments -- (1) ObjectGoal Navigation (e. g. 'find & go to a chair') and (2) Pick&Place (e. g. 'find mug, pick mug, find counter, place mug on counter').

Imitation Learning ObjectGoal Navigation +1

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

no code implementations NeurIPS 2021 Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra

Natural language instructions for visual navigation often use scene descriptions (e. g., "bedroom") and object references (e. g., "green chairs") to provide a breadcrumb trail to a goal location.

Object Scene Classification +2

Ego4D: Around the World in 3,000 Hours of Egocentric Video

8 code implementations CVPR 2022 Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

Waypoint Models for Instruction-guided Navigation in Continuous Environments

1 code implementation ICCV 2021 Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, Oleksandr Maksymets

Little inquiry has explicitly addressed the role of action spaces in language-guided visual navigation -- either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory.

Instruction Following Visual Navigation

Benchmarking Augmentation Methods for Learning Robust Navigation Agents: the Winning Entry of the 2021 iGibson Challenge

no code implementations22 Sep 2021 Naoki Yokoyama, Qian Luo, Dhruv Batra, Sehoon Ha

Recent advances in deep reinforcement learning and scalable photorealistic simulation have led to increasingly mature embodied AI for various visual tasks, including navigation.

Benchmarking Image Augmentation +4

Realistic PointGoal Navigation via Auxiliary Losses and Information Bottleneck

1 code implementation17 Sep 2021 Guillermo Grande, Dhruv Batra, Erik Wijmans

Under this setting, the agent incurs a penalty for using this privileged information, encouraging the agent to only leverage this information when it is crucial to learning.

PointGoal Navigation

Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice

1 code implementation26 Jun 2021 Nirbhay Modhe, Harish Kamath, Dhruv Batra, Ashwin Kalyan

This work shows that value-aware model learning, known for its numerous theoretical benefits, is also practically viable for solving challenging continuous control tasks in prevalent model-based reinforcement learning algorithms.

Continuous Control Model-based Reinforcement Learning

Auxiliary Tasks and Exploration Enable ObjectNav

1 code implementation8 Apr 2021 Joel Ye, Dhruv Batra, Abhishek Das, Erik Wijmans

We instead re-enable a generic learned agent by adding auxiliary learning tasks and an exploration reward.

Auxiliary Learning Navigate +2

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation

no code implementations14 Mar 2021 Naoki Yokoyama, Sehoon Ha, Dhruv Batra

Several related works on navigation have used Success weighted by Path Length (SPL) as the primary method of evaluating the path an agent makes to a goal location, but SPL is limited in its ability to properly evaluate agents with complex dynamics.

Navigate

Large Batch Simulation for Deep Reinforcement Learning

1 code implementation ICLR 2021 Brennan Shacklett, Erik Wijmans, Aleksei Petrenko, Manolis Savva, Dhruv Batra, Vladlen Koltun, Kayvon Fatahalian

We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19, 000 frames of experience per second on a single GPU and up to 72, 000 frames per second on a single eight-GPU machine.

PointGoal Navigation reinforcement-learning +1

THDA: Treasure Hunt Data Augmentation for Semantic Navigation

no code implementations ICCV 2021 Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, Dhruv Batra

We show that this is a natural consequence of optimizing for the task metric (which in fact penalizes exploration), is enabled by powerful observation encoders, and is possible due to the finite set of training environment configurations.

Data Augmentation Navigate +3

How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget

no code implementations11 Dec 2020 Erik Wijmans, Irfan Essa, Dhruv Batra

PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge.

PointGoal Navigation

Bi-directional Domain Adaptation for Sim2Real Transfer of Embodied Navigation Agents

no code implementations24 Nov 2020 Joanne Truong, Sonia Chernova, Dhruv Batra

Simulation offers the ability to train large numbers of robots in parallel, and offers an abundance of data.

Domain Adaptation PointGoal Navigation Robotics

Where Are You? Localization from Embodied Dialog

2 code implementations EMNLP 2020 Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.

Navigate Visual Dialog

Sim-to-Real Transfer for Vision-and-Language Navigation

1 code implementation7 Nov 2020 Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee

We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.

Vision and Language Navigation

SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

1 code implementation NAACL 2021 Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju

Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong.

Question Answering Visual Grounding +1

Contrast and Classify: Training Robust VQA Models

1 code implementation ICCV 2021 Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal

Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions.

Contrastive Learning Data Augmentation +4

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

1 code implementation2 Oct 2020 Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?")

Decoder Representation Learning

Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

no code implementations7 Sep 2020 Samyak Datta, Oleksandr Maksymets, Judy Hoffman, Stefan Lee, Dhruv Batra, Devi Parikh

This enables a seamless adaption to changing dynamics (a different robot or floor type) by simply re-calibrating the visual odometry model -- circumventing the expense of re-training of the navigation policy.

Navigate Robot Navigation +1

Auxiliary Tasks Speed Up Learning PointGoal Navigation

1 code implementation9 Jul 2020 Joel Ye, Dhruv Batra, Erik Wijmans, Abhishek Das

PointGoal Navigation is an embodied task that requires agents to navigate to a specified point in an unseen environment.

Navigate PointGoal Navigation

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

3 code implementations23 Jun 2020 Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, Erik Wijmans

In particular, the agent is initialized at a random location and pose in an environment and asked to find an instance of an object category, e. g., find a chair, by navigating to it.

Object

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments – Extended Abstract

no code implementations ICML Workshop LaReL 2020 Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.

Vision and Language Navigation

Extended Abstract: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

no code implementations ICML Workshop LaReL 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop near the sofa' requires an agent to ground scene elements referenced via language (e. g.'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

1 code implementation ECCV 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e. g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

4 code implementations ECCV 2020 Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.

Vision and Language Navigation

Analyzing Visual Representations in Embodied Navigation Tasks

no code implementations12 Mar 2020 Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos

Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task.

Reinforcement Learning (RL)

Insights on Visual Representations for Embodied Navigation Tasks

no code implementations ICLR 2020 Erik Wijmans, Julian Straub, Irfan Essa, Dhruv Batra, Judy Hoffman, Ari Morcos

Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures.

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

2 code implementations ECCV 2020 Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das

Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model!

Language Modelling Representation Learning +2

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

8 code implementations ICLR 2020 Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

Autonomous Navigation Navigate +2

DS-VIC: Unsupervised Discovery of Decision States for Transfer in RL

no code implementations25 Sep 2019 Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, Ramakrishna Vedantam

We learn to identify decision states, namely the parsimonious set of states where decisions meaningfully affect the future states an agent can reach in an environment.

Improving Generative Visual Dialog by Answering Diverse Questions

1 code implementation IJCNLP 2019 Vishvak Murahari, Prithvijit Chattopadhyay, Dhruv Batra, Devi Parikh, Abhishek Das

Prior work on training generative Visual Dialog models with reinforcement learning(Das et al.) has explored a Qbot-Abot image-guessing game and shown that this 'self-talk' approach can lead to improved performance at the downstream dialog-conditioned image-guessing task.

Representation Learning Visual Dialog

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

no code implementations ICCV 2019 Jyoti Aneja, Harsh Agrawal, Dhruv Batra, Alexander Schwing

We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future.

Diversity Image Captioning +2

Cross-Task Knowledge Transfer for Visually-Grounded Navigation

no code implementations ICLR 2019 Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra

Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for two different tasks: learning to follow navigational instructions and embodied question answering.

Disentanglement Embodied Question Answering +3

Emergence of Compositional Language with Deep Generational Transmission

1 code implementation ICLR 2020 Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.

Reinforcement Learning (RL)

Counterfactual Visual Explanations

1 code implementation16 Apr 2019 Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, Stefan Lee

In this work, we develop a technique to produce counterfactual visual explanations.

counterfactual General Classification +1

Multi-Target Embodied Question Answering

1 code implementation CVPR 2019 Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.

Embodied Question Answering Navigate +1

Embodied Visual Recognition

no code implementations9 Apr 2019 Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, Dhruv Batra

Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.

Object Object Localization +1

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

no code implementations CVPR 2019 Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D).

Embodied Question Answering Question Answering

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

no code implementations ICCV 2019 Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image.

Image Captioning Question Answering +2

EvalAI: Towards Better Evaluation Systems for AI Agents

3 code implementations10 Feb 2019 Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, Dhruv Batra

We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale.

Benchmarking BIG-bench Machine Learning

Embodied Multimodal Multitask Learning

no code implementations4 Feb 2019 Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra

In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their grounding in visual objects across the tasks.

Disentanglement Embodied Question Answering +3

Response to "Visual Dialogue without Vision or Dialogue" (Massiceti et al., 2018)

no code implementations16 Jan 2019 Abhishek Das, Devi Parikh, Dhruv Batra

In a recent workshop paper, Massiceti et al. presented a baseline model and subsequent critique of Visual Dialog (Das et al., CVPR 2017) that raises what we believe to be unfounded concerns about the dataset and evaluation.

Visual Dialog

nocaps: novel object captioning at scale

2 code implementations ICCV 2019 Harsh Agrawal, Karan Desai, YuFei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task.

Image Captioning Object +2

Fabrik: An Online Collaborative Neural Network Editor

no code implementations27 Oct 2018 Utsav Garg, Viraj Prabhu, Deshraj Yadav, Ram Ramrakhya, Harsh Agrawal, Dhruv Batra

We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser.

TarMAC: Targeted Multi-Agent Communication

no code implementations ICLR 2019 Abhishek Das, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rabbat, Joelle Pineau

We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments.

Multi-agent Reinforcement Learning

Neural Modular Control for Embodied Question Answering

2 code implementations26 Oct 2018 Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning.

Embodied Question Answering Imitation Learning +3

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations1 Oct 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation

Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

1 code implementation ECCV 2018 Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

Our approach, which we call Neuron Importance-AwareWeight Transfer (NIWT), learns to map domain knowledge about novel "unseen" classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts - essentially learning classifiers by discovering and composing learned semantic concepts in deep networks.

Generalized Zero-Shot Learning

Graph R-CNN for Scene Graph Generation

3 code implementations ECCV 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Graph Generation Scene Graph Generation

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

9 code implementations26 Jul 2018 Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

Data Augmentation Visual Question Answering (VQA)

Talk the Walk: Navigating New York City through Grounded Dialogue

1 code implementation9 Jul 2018 Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela

We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception.

Navigate

Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations

no code implementations ICML 2018 Ashwin Kalyan, Stefan Lee, Anitha Kannan, Dhruv Batra

Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being correct for an input - e. g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e. g. all English sentences).

Diversity Multi-Label Classification +4

Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples

no code implementations ICLR 2018 Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, Sumit Gulwani

In this work, we propose Neural Guided Deductive Search (NGDS), a hybrid synthesis technique that combines the best of both symbolic logic techniques and statistical models.

Program Synthesis

Neural Baby Talk

1 code implementation CVPR 2018 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.

Image Captioning Object +3

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

1 code implementation CVPR 2018 Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi

Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively).

Question Answering Visual Question Answering

Embodied Question Answering

4 code implementations CVPR 2018 Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?").

Embodied Question Answering Navigate +3

Deal or No Deal? End-to-End Learning of Negotiation Dialogues

no code implementations EMNLP 2017 Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, Dhruv Batra

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions.

Natural Language Does Not Emerge `Naturally' in Multi-Agent Dialog

1 code implementation EMNLP 2017 Satwik Kottur, Jos{\'e} Moura, Stefan Lee, Dhruv Batra

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, learned without any human supervision!

Slot Filling

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

3 code implementations26 Jun 2017 Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision!

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

3 code implementations16 Jun 2017 Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions.

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

1 code implementation NeurIPS 2017 Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.

Informativeness Metric Learning +2

Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

no code implementations CVPR 2017 Qing Sun, Stefan Lee, Dhruv Batra

We develop the first approximate inference algorithm for 1-Best (and M-Best) decoding in bidirectional neural sequence models by extending Beam Search (BS) to reason about both forward and backward time dependencies.

Image Captioning Sentence

ParlAI: A Dialog Research Software Platform

22 code implementations EMNLP 2017 Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston

We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl. ai.

reinforcement-learning Reinforcement Learning (RL) +1

The Promise of Premise: Harnessing Question Premises in Visual Question Answering

1 code implementation EMNLP 2017 Aroma Mahendru, Viraj Prabhu, Akrit Mohapatra, Dhruv Batra, Stefan Lee

In this paper, we make a simple observation that questions about images often contain premises - objects and relationships implied by the question - and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions.

Question Answering Visual Question Answering

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

no code implementations26 Apr 2017 Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, Devi Parikh

Finally, we evaluate several existing VQA models under this new setting and show that the performances of these models degrade by a significant amount compared to the original VQA setting.

Question Answering Visual Question Answering

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

7 code implementations ICCV 2017 Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.

reinforcement-learning Reinforcement Learning (RL) +2

LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

1 code implementation5 Mar 2017 Jianwei Yang, Anitha Kannan, Dhruv Batra, Devi Parikh

We present LR-GAN: an adversarial image generation model which takes scene structure and context into account.

Image Generation

Visual Dialog

11 code implementations CVPR 2017 Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content.

AI Agent Chatbot +2

Grad-CAM: Why did you say that?

2 code implementations22 Nov 2016 Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations.

Image Captioning Visual Question Answering

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

25 code implementations7 Oct 2016 Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra

We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.

Diversity Image Captioning +5

Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

no code implementations NeurIPS 2016 Stefan Lee, Senthil Purushwalkam, Michael Cogswell, Viresh Ranjan, David Crandall, Dhruv Batra

Many practical perception systems exist within larger processes that include interactions with users or additional components capable of evaluating the quality of predicted solutions.

Multiple-choice

Analyzing the Behavior of Visual Question Answering Models

1 code implementation EMNLP 2016 Aishwarya Agrawal, Dhruv Batra, Devi Parikh

Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA).

Question Answering Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations17 Jun 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations EMNLP 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering