We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc.
On semi-supervised learning benchmarks we improve performance significantly when only 1% ImageNet labels are available, from 53. 8% to 56. 5%.
Ranked #1 on Image Classification on PASCAL VOC 2007
Despite these strong priors, we show that deep trackers often default to tracking by saliency detection - without relying on the object instance representation.
We introduce a self-supervised representation learning method based on the task of temporal alignment between videos.
Ranked #1 on Video Alignment on UPenn Action
We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework.
In this work we explore a new approach for robots to teach themselves about the world simply by observing it.
We present a Deep Cuboid Detector which takes a consumer-quality RGB image of a cluttered scene and localizes all 3D cuboids (box-like objects).