We benchmark the performance of offline RL and IL algorithms on our assembly tasks and demonstrate the need to improve such algorithms to be able to solve our tasks in the real world, providing ample opportunities for future research.
Then, we train a high-level module to comprehend the task specification (e. g., input/output pairs or demonstrations) from long programs and produce a sequence of task embeddings, which are then decoded by the program decoder and composed to yield the synthesized program.
The ability to leverage shared behaviors between tasks is critical for sample-efficient multi-task reinforcement learning (MTRL).
We propose an approach for semantic imitation, which uses demonstrations from a source domain, e. g. human videos, to accelerate reinforcement learning (RL) in a different target domain, e. g. a robotic manipulator in a simulated kitchen.
Large-scale data is an essential component of machine learning as demonstrated in recent advances in natural language processing and computer vision research.
From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step.
We investigate the effectiveness of unsupervised and task-induced representation learning approaches on four visually complex environments, from Distracting DMControl to the CARLA driving simulator.
Task progress is intuitive and readily available task information that can guide an agent closer to the desired goal.
However, these approaches require larger state distributions to be covered as more policies are sequenced, and thus are limited to short skill sequences.
Learning complex manipulation tasks in realistic, obstructed environments is a challenging problem due to hard exploration in the presence of obstacles and high-dimensional visual observations.
To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task.
Prior approaches for demonstration-guided RL treat every new task as an independent learning problem and attempt to follow the provided demonstrations step-by-step, akin to a human trying to imitate a completely unseen behavior by following the demonstrator's exact muscle movements.
In this paper, we propose a novel policy transfer method with iterative "environment grounding", IDAPT, that alternates between (1) directly minimizing both visual and dynamics domain gaps by grounding the source environment in the target environment domains, and (2) training a policy on the grounded source environment.
Active learning is widely used to reduce labeling effort and training time by repeatedly querying only the most beneficial samples from unlabeled data.
A fundamental trait of intelligence is the ability to achieve goals in the face of novel circumstances, such as making decisions from new action choices.
We validate our approach, SPiRL (Skill-Prior RL), on complex navigation and robotic manipulation tasks and show that learned skill priors are essential for effective skill transfer from rich datasets.
In contrast, motion planners use explicit models of the agent and environment to plan collision-free paths to faraway goals, but suffer from inaccurate models in tasks that require contacts with the environment.
When mastering a complex manipulation task, humans often decompose the task into sub-skills of their body parts, practice the sub-skills independently, and then execute the sub-skills together.
The IKEA Furniture Assembly Environment is one of the first benchmarks for testing and accelerating the automation of complex manipulation tasks.
Model-agnostic meta-learners aim to acquire meta-learned parameters from similar tasks to adapt to novel tasks from the same distribution with few gradient updates.
To flexibly and efficiently reason about temporal sequences, abstract representations that compactly represent the important information in the sequence are needed.
Hence, we propose a framework to enable generalization over both these aspects: understanding an action’s functionality, and using actions to solve tasks through reinforcement learning.
Intelligent creatures acquire complex skills by exploiting previously learned skills and learning to transition between them.
A noisy and diverse demonstration set may hinder the performances of an agent aiming to acquire certain skills via imitation learning.
One important limitation of such frameworks is that they seek a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they are able to learn from.
We address the task of multi-view novel view synthesis, where we are interested in synthesizing a target image with an arbitrary camera pose from given source images.
Ranked #1 on Novel View Synthesis on KITTI Novel View Synthesis
We complete unseen tasks by choosing new sequences of skill latents to control the robot using MPC, where our MPC model is composed of the pre-trained skill policy executed in the simulation environment, run in parallel with the real robot.
Personal robots assisting humans must perform complex manipulation tasks that are typically difficult to specify in traditional motion planning pipelines, where multiple objectives must be met and the high-level context be taken into consideration.
In this paper, we augment MAML with the capability to identify tasks sampled from a multimodal task distribution and adapt quickly through gradient updates.
In particular, we first use simulation to jointly learn a policy for a set of low-level skills, and a "skill embedding" parameterization which can be used to compose them.
Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects.
Ranked #2 on Video-to-image Affordance Grounding on OPRA (28x28)
We seek to understand the arrow of time in videos -- what makes videos look like they are playing forwards or backwards?
Ranked #49 on Self-Supervised Action Recognition on UCF101
3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes.
Recent state-of-the-art reinforcement learning algorithms are trained under the goal of excelling in one specific task.
We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e. g., "dressing") to the action (e. g., "mix yogurt") that produced it.
To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine.
In this work, we propose 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data.
Humans demonstrate remarkable abilities to predict physical events in dynamic scenes, and to infer the physical properties of objects from static images.
Our system works by generalizing across object classes: states and transformations learned on one set of objects are used to interpret the image collection for an entirely new object class.
In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels - it is so much more.
Our features, called sketch tokens, are learned using supervised mid-level information in the form of hand drawn contours in images.