But most importantly, we are able to implement an exploration policy on a robot which learns to interact with objects completely from scratch just using data collected via the differentiable exploration module.
We observe a wide variety of drastically diverse locomotion styles across morphologies as well as centralized coordination emerging via message passing between decentralized modules purely from the reinforcement learning objective.
In this work, we leverage recent advances in rapid adaptation for locomotion control, and extend them to work on bipedal robots.
The 3D shapes are generated implicitly as deformations to a category-specific signed distance field and are learned in an unsupervised manner solely from unaligned image collections and their poses without any 3D supervision.
We show the proposed curriculum suffices to break the reconstruction-segmentation trade-off, and slow inference greatly improves segmentation in out-of-distribution scenes.
Human hands and robot hands differ in shape, size, and joint structure, and performing this translation from a single uncalibrated camera is a highly underconstrained problem.
We interpolate between the source robot and the target robot by finding a continuous evolutionary change of robot parameters.
The major strength of CLEAR over prior CL benchmarks is the smooth temporal evolution of visual concepts with real-world imagery, including both high-quality labeled data along with abundant unlabeled samples per time period for continual semi-supervised learning.
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis and show that it improves the sample efficiency of both state-based and image-based RL.
A safety advisor module adds sensed unexpected obstacles to the occupancy map and environment-determined speed limits to the velocity command generator.
In this setup, the agent first learns to explore across many environments without any extrinsic goal in a task-agnostic manner.
We show that a single generalist policy can perform in-hand manipulation of over 100 geometrically-diverse real-world objects and generalize to new objects with unseen shape or size.
An alternate but important component to consider improving is the interface of the RL algorithm with the robot.
We demonstrate that learning to minimize energy consumption plays a key role in the emergence of natural locomotion gaits at different speeds in real quadruped robots.
Reward signals in reinforcement learning can be expensive signals in many tasks and often require access to direct state.
Finally, we provide an empirical analysis and recommend general recipes for efficient transfer learning of vision and language models.
Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear.
How can an artificial agent learn to solve a wide range of tasks in a complex visual environment in the absence of external supervision?
Policies trained in simulation often fail when transferred to the real world due to the `reality gap' where the simulator is unable to accurately capture the dynamics and visual properties of the real world.
We present Worldsheet, a method for novel view synthesis using just a single RGB image as input.
A majority of methods for video frame interpolation compute bidirectional optical flow between adjacent frames of a video, followed by a suitable warping algorithm to generate the output frames.
We show that NDPs outperform the prior state-of-the-art in terms of either efficiency or performance across several robotic control tasks for both imitation and reinforcement learning setups.
Ranked #4 on Meta-Learning on MT50
Learning long-term dynamics models is the key to understanding physical common sense.
Ranked #1 on Visual Reasoning on PHYRE-1B-Within
We observe that a wide variety of drastically diverse locomotion styles across morphologies as well as centralized coordination emerges via message passing between decentralized modules purely from the reinforcement learning objective.
For tasks such as image completion, these models are unable to use much of the observed context.
Ranked #1 on Image Generation on MNIST
Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge.
Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn.
To operate effectively in the real world, agents should be able to act from high-dimensional raw sensory input such as images and achieve diverse goals across long time-horizons.
We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective.
Instead of direct manual supervision which is tedious and prone to bias, in this work, our goal is to extract reusable skills from a collection of human demonstrations collected directly for several end-tasks.
In this paper, we propose a formulation for exploration inspired by the work in active learning literature.
Generative Adversarial Networks (GANs) can produce images of surprising complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene.
However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent.
Ranked #11 on Atari Games on Atari 2600 Montezuma's Revenge
Generative Adversarial Networks (GANs) can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene.
The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels.
In our framework, the role of the expert is only to communicate the goals (i. e., what to imitate) during inference.
Our proposed method encourages bijective consistency between the latent encoding and output modes.
In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether.
Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature.
In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
We present a regression framework which models the output distribution of neural networks.
We propose Constrained CNN (CCNN), a method which uses a novel loss function to optimize for any set of linear constraints on the output space (i. e. predicted label distribution) of a CNN.
We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network.
We develop methods for detector learning which exploit joint training over both weak and strong labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks.