We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations.
Standard model-based reinforcement learning (MBRL) approaches fit a transition model of the environment to all past experience, but this wastes model capacity on data that is irrelevant for policy improvement.
Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds.
We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration.
We address key challenges in long-horizon embodied exploration and navigation by proposing a new object transport task and a novel modular framework for temporally extended navigation.
Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question.
Previous studies in the perimeter defense game have largely focused on the fully observable setting where the true player states are known to all players.
Across applications spanning supervised classification and sequential control, deep learning has been reported to find "shortcut" solutions that fail catastrophically under minor changes in the data distribution.
Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets.
We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile regression-based offline imitation learning (IL) algorithm derived via state-occupancy matching.
no code implementations • 19 Jan 2022 • Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J. Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J. How, Justus M Kebschull, John W. Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Yezerets, Dinesh Jayaraman, Jong M. Shin, Soledad Villar, Ian Phillips, Carey E. Priebe, Thomas Hartung, Michael I. Miller, Jayanta Dey, Ningyuan, Huang, Eric Eaton, Ralph Etienne-Cummings, Elizabeth L. Ogburn, Randal Burns, Onyema Osuagwu, Brett Mensh, Alysson R. Muotri, Julia Brown, Chris White, Weiwei Yang, Andrei A. Rusu, Timothy Verstynen, Konrad P. Kording, Pratik Chaudhari, Joshua T. Vogelstein
We conjecture that certain sequences of tasks are not retrospectively learnable (in which the data distribution is fixed), but are prospectively learnable (in which distributions may be dynamic), suggesting that prospective learning is more difficult in kind than retrospective learning.
Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment.
When operating in partially observed settings, it is important for a control policy to fuse information from a history of observations.
This paper focuses on the problem of 3D human reconstruction from 2D evidence.
Ranked #59 on 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)
Training visual control policies from scratch on a new robot typically requires generating large amounts of robot-specific data.
We prove that CODAC learns a conservative return distribution -- in particular, for finite MDPs, CODAC converges to an uniform lower bound on the quantiles of the return distribution; our proof relies on a novel analysis of the distributional Bellman operator.
The difficulty of optimal control problems has classically been characterized in terms of system properties such as minimum eigenvalues of controllability/observability gramians.
We propose Likelihood-Based Diverse Sampling (LDS), a method for improving the quality and the diversity of trajectory samples from a pre-trained flow model.
Imitation learning trains policies to map from input observations to the actions that an expert would choose.
Scaling model-based inverse reinforcement learning (IRL) to real robotic manipulation tasks with unknown dynamics remains an open problem.
Reinforcement learning (RL) in real-world safety-critical target settings like urban driving is hazardous, imperiling the RL agent, other agents, and the environment.
In this work we propose a framework for visual prediction and planning that is able to overcome both of these limitations.
no code implementations • 29 May 2020 • Mike Lambeta, Po-Wei Chou, Stephen Tian, Brian Yang, Benjamin Maloon, Victoria Rose Most, Dave Stroud, Raymond Santos, Ahmad Byagowi, Gregg Kammerer, Dinesh Jayaraman, Roberto Calandra
Despite decades of research, general purpose in-hand manipulation remains one of the unsolved challenges of robotics.
Existing approaches for visuomotor robotic control typically require characterizing the robot in advance by calibrating the camera or performing system identification.
Every living organism struggles against disruptive environmental forces to carve out and maintain an orderly niche.
All living organisms struggle against the forces of nature to carve out niches where they can maintain relative stasis.
We study the problem of safe adaptation: given a model trained on a variety of past experiences for some task, can this model learn to perform that task in a new situation while avoiding catastrophic failure?
Prior work on video generation largely focuses on prediction models that only observe frames from the beginning of the video.
Standard computer vision systems assume access to intelligently captured inputs (e. g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself.
Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment.
We envision REPLAB as a framework for reproducible research across manipulation tasks, and as a step in this direction, we define a template for a grasping benchmark consisting of a task definition, evaluation protocol, performance measures, and a dataset of 92k grasp attempts.
Touch sensing is widely acknowledged to be important for dexterous robotic manipulation, but exploiting tactile sensing for continuous, non-prehensile manipulation is challenging.
This model -- a deep, multimodal convolutional network -- predicts the outcome of a candidate grasp adjustment, and then executes a grasp by iteratively selecting the most promising actions.
We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation.
It is common to implicitly assume access to intelligently captured inputs (e. g., photos from a human photographer), yet autonomously capturing good observations is itself a major challenge.
AutoCam leverages NFOV web video to discriminatively identify space-time "glimpses" of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories.
Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances.
To verify this hypothesis, we attempt to induce this capacity in our active recognition pipeline, by simultaneously learning to forecast the effects of the agent's motions on its internal representation of the environment conditional on all past views.
While this standard approach captures the fact that high-level visual signals change slowly over time, it fails to capture *how* the visual content changes.
Understanding how images of objects and scenes behave in response to specific ego-motions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images.
Existing methods to learn visual attributes are prone to learning the wrong thing---namely, properties that are correlated with the attribute of interest among training samples.