We address the problem of learning representations from observations of a scene involving an agent and an external object the agent interacts with.
An optimization-based Contrastive Language-Image Pre-training (CLIP) is then utilized to guide the latent representation of a fashion image in the direction of a target attribute expressed in terms of a text prompt.
Data-driven and controllable human motion synthesis and prediction are active research areas with various applications in interactive media and social robotics.
We introduce Equivariant Isomorphic Networks (EquIN) -- a method for learning representations that are equivariant with respect to general group actions over data.
We study the problem of learning graph dynamics of deformable objects that generalizes to unknown physical properties.
We present CycleDance, a dance style transfer system to transform an existing motion clip in one dance style to a motion clip in another dance style while attempting to preserve motion context of the dance.
Structural node embeddings, vectors capturing local connectivity information for each node in a graph, have many applications in data mining and machine learning, e. g., network alignment and node classification, clustering and anomaly detection.
However, a major challenge is a distributional shift between the states in the training dataset and the ones visited by the learned policy at the test time.
In this work we provide an analysis of the distribution of the post-adaptation parameters of Gradient-Based Meta-Learning (GBML) methods.
We introduce a general method for learning representations that are equivariant to symmetries of data.
We introduce an algorithm for active function approximation based on nearest neighbor regression.
We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models.
Advanced representation learning techniques require reliable and general evaluation methods.
We argue that when comparing two graphs, the distribution of node structural features is more informative than global graph statistics which are often used in practice, especially to evaluate graph generative models.
Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels.
Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space.
The state-of-the-art unsupervised contrastive visual representation learning methods that have emerged recently (SimCLR, MoCo, SwAV) all make use of data augmentations in order to construct a pretext task of instant discrimination consisting of similar and dissimilar pairs of images.
Evaluating the quality of learned representations without relying on a downstream task remains one of the challenges in representation learning.
Data-driven approaches for modeling human skeletal motion have found various applications in interactive media and social robotics.
Few-shot meta-learning methods aim to learn the common structure shared across a set of tasks to facilitate learning new tasks with small amounts of data.
Our results show that the proposed method can successfully adapt a trained policy to different robotic platforms with novel physical parameters and the superiority of our meta-learning algorithm compared to state-of-the-art methods for the introduced few-shot policy adaptation problem.
In an ablation study, we show the benefits of the two-stage model for single time step prediction and the effectiveness of the mixed-horizon model for long-term prediction tasks.
We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects.
In this work, we address the interpretability of NN-based models by introducing the kinodynamic images.
Deformable objects present a formidable challenge for robotic manipulation due to the lack of canonical low-dimensional representations and the difficulty of capturing, predicting, and controlling such objects.
We present a Witness Autoencoder (W-AE) – an autoencoder that captures geodesic distances of the data in the latent space.
We present a data-efficient framework for solving visuomotor sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models.
We present a reinforcement learning based framework for human-centered collaborative systems.
Our second contribution is a unifying mathematical formulation for learning latent relations.
To further investigate this matter, we analyze a discrete-time linear autonomous system, and show theoretically how this relates to a model with a single ReLU and how common properties can result in dying ReLU.
Research on automated, image based identification of clothing categories and fashion landmarks has recently gained significant interest due to its potential impact on areas such as robotic clothing manipulation, automated clothes sorting and recycling, and online shopping.
We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces such as manipulation of deformable objects.
The purpose of this benchmark is to evaluate the planning and control aspects of robotic in-hand manipulation systems.
Recent findings show that deep generative models can judge out-of-distribution samples as more likely than those drawn from the same distribution as the training data.
To coordinate actions with an interaction partner requires a constant exchange of sensorimotor signals.
Learning dynamics models is an essential component of model-based reinforcement learning.
In socially assistive robotics, an important research area is the development of adaptation techniques and their effect on human-robot interaction.
We propose a model and architecture for a sequential variational autoencoder that embeds the space of simulated trajectories into a lower-dimensional space of latent paths in an unsupervised way.
The qualitative experiments show results of pose and shape estimation of objects held by a hand "in the wild".
Our further contribution is a neural network architecture and training pipeline that use experience from grasping objects in simulation to learn grasp stability scores.
Therefore, video-based human activity modeling is concerned with a number of tasks such as inferring current and future semantic labels, predicting future continuous observations as well as imagining possible future label and feature sequences.
In this work we introduce semi-supervised variational recurrent neural networks which are able to a) model temporal distributions over latent factors and the observable feature space, b) incorporate discrete labels such as activity type when available, and c) generate possible future action sequences on both feature and label level.
Moving a human body or a large and bulky object can require the strength of whole arm manipulation (WAM).
The low-dimensional space, and master policy found by our method enables policies to quickly adapt to new environments.
Usually, this is achieved by precisely modeling physical properties of the objects, robot, and the environment for explicit planning.
Fluent and safe interactions of humans and robots require both partners to anticipate the others' actions.
To quantify the learned features, we use the output of different layers for action classification and visualize the receptive fields of the network units.
Modeling of physical human-robot collaborations is generally a challenging problem due to the unpredictive nature of human behavior.
In this paper, we provide an extensive evaluation of the performance of local descriptors for tracking applications.
Recent approaches in robotics follow the insight that perception is facilitated by interaction with the environment.
We present a novel method for learning densities with bounded support which enables us to incorporate `hard' topological constraints.