We present C$\cdot$ASE, an efficient and effective framework that learns conditional Adversarial Skill Embeddings for physics-based characters.
However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost.
It is essential yet challenging for future home-assistant robots to understand and manipulate diverse 3D objects in daily human environments.
Previous approaches either choose the frontier as the goal position via a myopic solution that hinders the time efficiency, or maximize the long-term value via reinforcement learning to directly regress the goal position, but does not guarantee the complete map construction.
We describe a method to deal with performance drop in semantic segmentation caused by viewpoint changes within multi-camera systems, where temporally paired images are readily available, but the annotations may only be abundant for a few typical views.
Part assembly is a typical but challenging task in robotics, where robots assemble a set of individual parts into a complete shape.
Perceiving and interacting with 3D articulated objects, such as cabinets, doors, and faucets, pose particular challenges for future home-assistant robots performing daily tasks in human environments.
Furthermore, to resolve ambiguities in converting the semantic images to semantic labels, we treat the view transformation network as a functional representation of an unknown mapping implied by the color images and propose functional label hallucination to generate pseudo-labels in the target domain.
Another approach is to concatenate all the modalities into a tuple and then contrast positive and negative tuple correspondences.
Ranked #60 on Semantic Segmentation on NYU Depth v2
In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals.
To this end, we create a large-scale dataset with these three components annotated by artists in a human-in-the-loop manner.
For the first time, we propose a unified framework that can handle 9DoF pose tracking for novel rigid object instances as well as per-part pose tracking for articulated objects from known categories.
Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks.
These approaches localize the camera in the discrete pose space and are agnostic to the localization-driven scene property, which restricts the camera pose accuracy in the coarse scale.
Localizing the camera in a known indoor environment is a key building block for scene mapping, robot navigation, AR, etc.
The former aims to recover the surface of point cloud through implicit function, while the latter encourages evenly-distributed points.
Analogous to buying an IKEA furniture, given a set of 3D parts that can assemble a single shape, an intelligent agent needs to perceive the 3D part geometry, reason to propose pose estimations for the input parts, and finally call robotic planning and control routines for actuation.
Reflections are very common phenomena in our daily photography, which distract people's attention from the scene behind the glass.
The task of classifying X-ray data is a problem of both theoretical and clinical interest.
To overcome this limitation, we propose a new decoupled learning algorithm to learn from the operator parameters to dynamically adjust the weights of a deep network for image operators, denoted as the base network.
Image dehazing aims to recover the uncorrupted content from a hazy image.
Ranked #1 on Rain Removal on DID-MDN
Image smoothing represents a fundamental component of many disparate computer vision and graphics applications.
Many different deep networks have been used to approximate, accelerate or improve traditional image operators, such as image smoothing, super-resolution and denoising.
Removing reflection artefacts from a single image is a problem of both theoretical and practical interest, which still presents challenges because of the massively ill-posed nature of the problem.
This paper proposes a deep neural network structure that exploits edge information in addressing representative low-level vision tasks such as layer separation and image filtering.
While invaluable for many computer vision applications, decomposing a natural image into intrinsic reflectance and shading layers represents a challenging, underdetermined inverse problem.