In our method, a policy conditioned on a continuous or discrete latent variable is trained by directly maximizing the variational lower bound of the mutual information, instead of using the mutual information as unsupervised rewards as in previous studies.
The experimental results indicate that the solution manifold can be learned with the proposed algorithm, and the trained model represents an infinite set of homotopic solutions for motion-planning problems.
Model-based meta-reinforcement learning (RL) methods have recently been shown to be a promising approach to improving the sample efficiency of RL in multi-task settings.
The learned decoder can be used as a motion planner in which the user can specify the goal position and the trajectory types by setting the latent variables.
Finally, we investigate the application of multi-agent methods to high-dimensional robotic tasks and show that our approach can be used to learn decentralized policies in this domain.
However, identifying the hierarchical policy structure that enhances the performance of RL is not a trivial task.
This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning.
Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL).