Search Results for author: Stefan Lee

Found 65 papers, 33 papers with code

You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

no code implementations26 Oct 2024 Eric Slyman, Anirudh Kanneganti, Sanghyun Hong, Stefan Lee

We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs.

Quantization

Language-Informed Beam Search Decoding for Multilingual Machine Translation

1 code implementation11 Aug 2024 Yilin Yang, Stefan Lee, Prasad Tadepalli

Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input.

Language Identification Machine Translation +2

Point Cloud Models Improve Visual Robustness in Robotic Learners

no code implementations29 Apr 2024 Skand Peri, Iain Lee, Chanho Kim, Li Fuxin, Tucker Hermans, Stefan Lee

In this work, we examine robustness to a suite of these types of visual changes for RGB-D and point cloud based visual control policies.

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

no code implementations CVPR 2024 Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset.

Fairness

Viewpoint-Aware Visual Grounding in 3D Scenes

no code implementations CVPR 2024 Xiangxi Shi, Zhonghua Wu, Stefan Lee

In this paper we investigate the significance of viewpoint information in 3D visual grounding -- introducing a model that explicitly predicts the speaker's viewpoint based on the referring expression and scene.

3D visual grounding Referring Expression

VLSlice: Interactive Vision-and-Language Slice Discovery

1 code implementation ICCV 2023 Eric Slyman, Minsuk Kahng, Stefan Lee

Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks.

Slice Discovery

Behavioral Analysis of Vision-and-Language Navigation Agents

1 code implementation CVPR 2023 Zijiao Yang, Arjun Majumdar, Stefan Lee

To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings.

Vision and Language Navigation

Emergence of Maps in the Memories of Blind Navigation Agents

no code implementations30 Jan 2023 Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra

A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial.

Inductive Bias PointGoal Navigation

Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

no code implementations29 Nov 2022 Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, Devendra Singh Chaplot

We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image.

Visual Navigation

Iterative Vision-and-Language Navigation

no code implementations CVPR 2023 Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason

We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time.

Instruction Following Vision and Language Navigation

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

no code implementations20 Apr 2022 Jacob Krantz, Stefan Lee

Recent work in Vision-and-Language Navigation (VLN) has presented two environmental paradigms with differing realism -- the standard VLN setting built on topological environments where navigation is abstracted away, and the VLN-CE setting where agents must navigate continuous 3D environments using low-level actions.

Navigate Vision and Language Navigation

PROMPT: Learning Dynamic Resource Allocation Policies for Network Applications

no code implementations19 Jan 2022 Drew Penney, Bin Li, Jaroslaw Sydir, Lizhong Chen, Charlie Tai, Stefan Lee, Eoin Walsh, Thomas Long

A growing number of service providers are exploring methods to improve server utilization and reduce power consumption by co-scheduling high-priority latency-critical workloads with best-effort workloads.

Scheduling

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

no code implementations NeurIPS 2021 Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra

Natural language instructions for visual navigation often use scene descriptions (e. g., "bedroom") and object references (e. g., "green chairs") to provide a breadcrumb trail to a goal location.

Object Scene Classification +2

Waypoint Models for Instruction-guided Navigation in Continuous Environments

1 code implementation ICCV 2021 Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, Oleksandr Maksymets

Little inquiry has explicitly addressed the role of action spaces in language-guided visual navigation -- either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory.

Instruction Following Visual Navigation

Piecewise-constant Neural ODEs

1 code implementation11 Jun 2021 Sam Greydanus, Stefan Lee, Alan Fern

Neural networks are a popular tool for modeling sequential data but they generally do not treat time as a continuous variable.

Deep Convolution for Irregularly Sampled Temporal Point Clouds

no code implementations1 May 2021 Erich Merrill, Stefan Lee, Li Fuxin, Thomas G. Dietterich, Alan Fern

We consider the problem of modeling the dynamics of continuous spatial-temporal processes represented by irregular samples through both space and time.

Starcraft Starcraft II

THDA: Treasure Hunt Data Augmentation for Semantic Navigation

no code implementations ICCV 2021 Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, Dhruv Batra

We show that this is a natural consequence of optimizing for the task metric (which in fact penalizes exploration), is enabled by powerful observation encoders, and is possible due to the finite set of training environment configurations.

Data Augmentation Navigate +3

Jumpy Recurrent Neural Networks

no code implementations1 Jan 2021 Samuel James Greydanus, Stefan Lee, Alan Fern

This structure enables our model to jump over long time intervals while retaining the ability to produce fine-grained or continuous-time predictions when necessary.

Time Series Time Series Analysis

Where Are You? Localization from Embodied Dialog

2 code implementations EMNLP 2020 Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.

Navigate Visual Dialog

Sim-to-Real Transfer for Vision-and-Language Navigation

1 code implementation7 Nov 2020 Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee

We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.

Vision and Language Navigation

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

2 code implementations ICLR 2021 Aayam Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern

We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience.

Offline RL reinforcement-learning +2

On the Sub-Layer Functionalities of Transformer Decoder

no code implementations Findings of the Association for Computational Linguistics 2020 Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, Zhaopeng Tu

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role.

Decoder de-en +3

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

1 code implementation2 Oct 2020 Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?")

Decoder Representation Learning

Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

no code implementations7 Sep 2020 Samyak Datta, Oleksandr Maksymets, Judy Hoffman, Stefan Lee, Dhruv Batra, Devi Parikh

This enables a seamless adaption to changing dynamics (a different robot or floor type) by simply re-calibrating the visual odometry model -- circumventing the expense of re-training of the navigation policy.

Navigate Robot Navigation +1

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments – Extended Abstract

no code implementations ICML Workshop LaReL 2020 Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.

Vision and Language Navigation

Extended Abstract: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

no code implementations ICML Workshop LaReL 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop near the sofa' requires an agent to ground scene elements referenced via language (e. g.'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

1 code implementation ECCV 2020 Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e. g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs').

Vision and Language Navigation

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

4 code implementations ECCV 2020 Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions.

Vision and Language Navigation

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations CVPR 2020 Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +3

Question-Conditioned Counterfactual Image Generation for VQA

no code implementations14 Nov 2019 Jingjing Pan, Yash Goyal, Stefan Lee

While Visual Question Answering (VQA) models continue to push the state-of-the-art forward, they largely remain black-boxes - failing to provide insight into how or why an answer is generated.

counterfactual Image Generation +2

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

8 code implementations ICLR 2020 Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

Autonomous Navigation Navigate +3

Emergence of Compositional Language with Deep Generational Transmission

1 code implementation ICLR 2020 Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra

In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language.

Reinforcement Learning Reinforcement Learning (RL)

Counterfactual Visual Explanations

1 code implementation16 Apr 2019 Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, Stefan Lee

In this work, we develop a technique to produce counterfactual visual explanations.

counterfactual General Classification +1

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

no code implementations CVPR 2019 Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D).

Embodied Question Answering Question Answering

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

no code implementations ICCV 2019 Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image.

Image Captioning Question Answering +2

EvalAI: Towards Better Evaluation Systems for AI Agents

3 code implementations10 Feb 2019 Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, Dhruv Batra

We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale.

Benchmarking BIG-bench Machine Learning

nocaps: novel object captioning at scale

2 code implementations ICCV 2019 Harsh Agrawal, Karan Desai, YuFei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task.

Image Captioning Object +2

Neural Modular Control for Embodied Question Answering

2 code implementations26 Oct 2018 Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning.

Embodied Question Answering Imitation Learning +4

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations1 Oct 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation +1

Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

1 code implementation ECCV 2018 Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

Our approach, which we call Neuron Importance-AwareWeight Transfer (NIWT), learns to map domain knowledge about novel "unseen" classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts - essentially learning classifiers by discovering and composing learned semantic concepts in deep networks.

Generalized Zero-Shot Learning

Graph R-CNN for Scene Graph Generation

3 code implementations ECCV 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Graph Generation Scene Graph Generation

Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations

no code implementations ICML 2018 Ashwin Kalyan, Stefan Lee, Anitha Kannan, Dhruv Batra

Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being correct for an input - e. g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e. g. all English sentences).

Diversity Multi-Label Classification +4

Embodied Question Answering

4 code implementations CVPR 2018 Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?").

Embodied Question Answering Navigate +4

Natural Language Does Not Emerge `Naturally' in Multi-Agent Dialog

1 code implementation EMNLP 2017 Satwik Kottur, Jos{\'e} Moura, Stefan Lee, Dhruv Batra

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, learned without any human supervision!

Slot Filling

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

3 code implementations26 Jun 2017 Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision!

Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

no code implementations CVPR 2017 Qing Sun, Stefan Lee, Dhruv Batra

We develop the first approximate inference algorithm for 1-Best (and M-Best) decoding in bidirectional neural sequence models by extending Beam Search (BS) to reason about both forward and backward time dependencies.

Image Captioning Sentence

The Promise of Premise: Harnessing Question Premises in Visual Question Answering

1 code implementation EMNLP 2017 Aroma Mahendru, Viraj Prabhu, Akrit Mohapatra, Dhruv Batra, Stefan Lee

In this paper, we make a simple observation that questions about images often contain premises - objects and relationships implied by the question - and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions.

Question Answering Visual Question Answering

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

7 code implementations ICCV 2017 Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.

reinforcement-learning Reinforcement Learning +3

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

25 code implementations7 Oct 2016 Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra

We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.

Diversity Image Captioning +5

Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

no code implementations NeurIPS 2016 Stefan Lee, Senthil Purushwalkam, Michael Cogswell, Viresh Ranjan, David Crandall, Dhruv Batra

Many practical perception systems exist within larger processes that include interactions with users or additional components capable of evaluating the quality of predicted solutions.

Multiple-choice

Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

no code implementations ICCV 2015 Sven Bambach, Stefan Lee, David J. Crandall, Chen Yu

Hands appear very often in egocentric video, and their appearance and pose give important cues about what people are doing and what they are paying attention to.

Hand Detection Hand Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.