A General Theory of Relativity in Reinforcement Learning

29 Sep 2021  ·  Lei Han, Cheng Zhou, Yizheng Zhang ·

We propose a new general theory measuring the relativity between two arbitrary Markov Decision Processes (MDPs) from the perspective of reinforcement learning (RL). Considering two MDPs, tasks such as policy transfer, dynamics modeling, environment design, and simulation to reality (sim2real), etc., are all closely related. The proposed theory deeply investigates the connection between any two cumulative expected returns defined on different policies and environment dynamics, and the theoretical results suggest two new general algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which can offer fast policy transfer and dynamics modeling. RPO updates the policy using the \emph{relative policy gradient} to transfer the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model (if there exists) using the \emph{relative transition gradient} to reduce the gap between the dynamics of the two environments. Then, integrating the two algorithms offers the complete algorithm Relative Policy-Transition Optimization (RPTO), in which the policy interacts with the two environments simultaneously, such that data collections from the two environments, policy and transition updates are all completed in a closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPO, RTO and RPTO in the OpenAI gym's classic control tasks by creating policy transfer problems.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here