Search Results for author: Jesse Thomason

Found 49 papers, 21 papers with code

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

7 code implementations CVPR 2020 Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

Natural Language Visual Grounding

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

1 code implementation ACL 2022 Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, Xin Eric Wang

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks.

Vision and Language Navigation

TEACh: Task-driven Embodied Agents that Chat

3 code implementations1 Oct 2021 Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, Dilek Hakkani-Tur

Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes.

Dialogue Understanding

Vision-and-Dialog Navigation

2 code implementations10 Jul 2019 Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer

To train agents that search an environment for a goal location, we define the Navigation from Dialog History task.

Visual Navigation

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

1 code implementation18 Jun 2022 Tejas Srinivasan, Ting-Yun Chang, Leticia Leonor Pinto Alva, Georgios Chochlakis, Mohammad Rostami, Jesse Thomason

Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks.

Continual Learning Transfer Learning

I2I: Initializing Adapters with Improvised Knowledge

1 code implementation4 Apr 2023 Tejas Srinivasan, Furong Jia, Mohammad Rostami, Jesse Thomason

We propose Improvise to Initialize (I2I), a continual learning algorithm that initializes Adapters for incoming tasks by distilling knowledge from previously-learned tasks' Adapters.

Continual Learning Question Answering +2

LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

1 code implementation10 Nov 2021 Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, Gaurav S. Sukhatme

However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts.

Indoor Scene Synthesis Scene Generation

Language Grounding with 3D Objects

2 code implementations26 Jul 2021 Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer

We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation.

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

1 code implementation13 Feb 2024 Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions.

Robot Manipulation Generalization

VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media

1 code implementation18 Aug 2022 Georgios Chochlakis, Tejas Srinivasan, Jesse Thomason, Shrikanth Narayanan

VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency.

Descriptive Image Captioning +4

RMM: A Recursive Mental Model for Dialog Navigation

1 code implementation2 May 2020 Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers.

Answer Generation Instruction Following

RMM: A Recursive Mental Model for Dialogue Navigation

1 code implementation Findings of the Association for Computational Linguistics 2020 Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers.

Answer Generation Instruction Following

Multimodal Speech Recognition for Language-Guided Embodied Agents

1 code implementation27 Feb 2023 Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Improving Sign Recognition with Phonology

2 code implementations11 Feb 2023 Lee Kezar, Jesse Thomason, Zed Sevcikova Sehyr

We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding.

Sign Language Recognition

Exploring Strategies for Modeling Sign Language Phonology

1 code implementation30 Sep 2023 Lee Kezar, Riley Carlin, Tejas Srinivasan, Zed Sehyr, Naomi Caselli, Jesse Thomason

Specifically, we explore how learning strategies like multi-task and curriculum learning can leverage mutually useful information between phoneme types to facilitate better modeling of sign language phonemes.

Interpreting Black Box Models via Hypothesis Testing

1 code implementation29 Mar 2019 Collin Burns, Jesse Thomason, Wesley Tansey

In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments.

Two-sample testing

Experience Grounds Language

2 code implementations EMNLP 2020 Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian

Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates.

Representation Learning

The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes

1 code implementation30 Sep 2023 Lee Kezar, Elana Pontecorvo, Adele Daniels, Connor Baer, Ruth Ferster, Lauren Berger, Jesse Thomason, Zed Sevcikova Sehyr, Naomi Caselli

Sign language recognition and translation technologies have the potential to increase access and inclusion of deaf signing communities, but research progress is bottlenecked by a lack of representative data.

Fairness Sign Language Recognition

Improving Robot Success Detection using Static Object Data

1 code implementation2 Apr 2019 Rosario Scalise, Jesse Thomason, Yonatan Bisk, Siddhartha Srinivasa

We collect over 13 hours of egocentric manipulation data for training a model to reason about whether a robot successfully placed unseen objects in or on one another.

Object

Interpretable Low-Dimensional Regression via Data-Adaptive Smoothing

no code implementations6 Aug 2017 Wesley Tansey, Jesse Thomason, James G. Scott

We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance.

Additive models Denoising +1

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

no code implementations1 Nov 2018 Jesse Thomason, Daniel Gordon, Yonatan Bisk

We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research.

Question Answering Visual Navigation

Prospection: Interpretable Plans From Language By Predicting the Future

no code implementations20 Mar 2019 Chris Paxton, Yonatan Bisk, Jesse Thomason, Arunkumar Byravan, Dieter Fox

High-level human instructions often correspond to behaviors with multiple implicit steps.

Shifting the Baseline: Single Modality Performance on Visual Navigation \& QA

no code implementations NAACL 2019 Jesse Thomason, Daniel Gordon, Yonatan Bisk

We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research.

Visual Navigation

The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation

no code implementations23 Oct 2020 Shurjo Banerjee, Jesse Thomason, Jason J. Corso

In each trial, the pair first cooperates to localize the robot on a global map visible to the Commander, then the Driver follows Commander instructions to move the robot to a sequence of target objects.

Navigate Simultaneous Localization and Mapping

Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

no code implementations1 Jul 2022 Sara Mohammadinejad, Jesse Thomason, Jyotirmoy V. Deshmukh

In this work, we propose DIALOGUESTL, an interactive approach for learning correct and concise STL formulas from (often) ambiguous NL descriptions.

Formal Logic Q-Learning +2

Curriculum Learning for Data-Efficient Vision-Language Alignment

no code implementations29 Jul 2022 Tejas Srinivasan, Xiang Ren, Jesse Thomason

Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data.

Contrastive Learning Image Retrieval +3

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

no code implementations22 Sep 2022 Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information.

Iterative Vision-and-Language Navigation

no code implementations CVPR 2023 Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason

We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time.

Instruction Following Vision and Language Navigation

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems

no code implementations26 Oct 2022 Wang Zhu, Jesse Thomason, Robin Jia

For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance.

Question Answering Visual Question Answering

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

no code implementations30 Nov 2022 Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, Gaurav S. Sukhatme

Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL).

Instruction Following Object Recognition +1

RREx-BoT: Remote Referring Expressions with a Bag of Tricks

no code implementations30 Jan 2023 Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson Piramuthu

Armed with this intuition, using only a generic vision-language scoring model with minor modifications for 3d encoding and operating in an embodied environment, we demonstrate an absolute performance gain of 9. 84% on remote object grounding above state of the art models for REVERIE and of 5. 04% on FAO.

Object Object Localization

Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation

no code implementations25 Mar 2023 Yuliang Cai, Jesse Thomason, Mohammad Rostami

The size and the computational load of fine-tuning large-scale pre-trained neural network are becoming two major obstacles in adopting machine learning in many applications.

Continual Learning Knowledge Distillation +1

Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

no code implementations24 May 2023 Wang Zhu, Jesse Thomason, Robin Jia

We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions.

Language Modelling Question Answering

Comparative Multi-View Language Grounding

no code implementations12 Nov 2023 Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

In this work, we consider the task of resolving object referents when given a comparative language description.

Object

Do Localization Methods Actually Localize Memorized Data in LLMs?

no code implementations15 Nov 2023 Ting-Yun Chang, Jesse Thomason, Robin Jia

Large language models (LLMs) can memorize many pretrained sequences verbatim.

Benchmarking

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

no code implementations28 Nov 2023 Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason

Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy.

Data Augmentation Translation +1

WinoViz: Probing Visual Properties of Objects Under Different States

no code implementations21 Feb 2024 Woojeong Jin, Tejas Srinivasan, Jesse Thomason, Xiang Ren

We present WinoViz, a text-only evaluation dataset, consisting of 1, 380 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states.

Language Modelling

Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

no code implementations23 Feb 2024 Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, Khyathi Raghavi Chandu

Prior work on selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain.

ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

no code implementations16 Mar 2024 Anthony Liang, Jesse Thomason, Erdem Biyik

Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot.

reinforcement-learning Reinforcement Learning (RL) +1

TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models

no code implementations25 Mar 2024 Ishika Singh, David Traum, Jesse Thomason

We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone and preserving execution success.

Cannot find the paper you are looking for? You can Submit a new open access paper.