1 code implementation • 10 Jul 2024 • Mohit Shridhar, Yat Long Lo, Stephen James
We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images.
Ranked #1 on Robot Manipulation Generalization on The COLOSSEUM
no code implementations • 29 Jun 2024 • Markus Grotz, Mohit Shridhar, Tamim Asfour, Dieter Fox
To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent -- PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks.
1 code implementation • 2 Oct 2023 • Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang
Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data.
no code implementations • 23 Jun 2023 • Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, Ranjay Krishna
By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized training, (2) does not require any real robots during data collection, and therefore, (3) enables manipulation of diverse objects with a real robot.
no code implementations • 13 Oct 2022 • Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi, Sonia Raychaudhuri, Mike Roberts, Silvio Savarese, Manolis Savva, Mohit Shridhar, Niko Sünderhauf, Andrew Szot, Ben Talbot, Joshua B. Tenenbaum, Jesse Thomason, Alexander Toshev, Joanne Truong, Luca Weihs, Jiajun Wu
We present a retrospective on the state of Embodied AI research.
1 code implementation • 12 Sep 2022 • Mohit Shridhar, Lucas Manuelli, Dieter Fox
With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task.
Ranked #5 on Robot Manipulation on RLBench
1 code implementation • 24 Sep 2021 • Mohit Shridhar, Lucas Manuelli, Dieter Fox
We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.
2 code implementations • 26 Jul 2021 • Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer
We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation.
no code implementations • ICLR 2021 • Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.
2 code implementations • 8 Oct 2020 • Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.
11 code implementations • CVPR 2020 • Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.
no code implementations • 11 Jun 2018 • Mohit Shridhar, David Hsu
The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects.
no code implementations • 18 Jul 2017 • Mohit Shridhar, David Hsu
A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions.