Plan Your Target and Learn Your Skills: State-Only Imitation Learning via Decoupled Policy Optimization

NeurIPS 2021 · Minghuan Liu, Zhengbang Zhu, Yuzheng Zhuang, Weinan Zhang, Jian Shen, Jianye Hao, Yong Yu, Jun Wang ·

State-only imitation learning (SOIL) enables agents to learn from massive demonstrations without explicit action or reward information. However, previous methods attempt to learn the implicit state-to-action mapping policy directly from state-only data, which results in ambiguity and inefficiency. In this paper, we overcome this issue by introducing hyper-policy as sets of policies that share the same state transition to characterize the optimality in SOIL. Accordingly, we propose Decoupled Policy Optimization (DPO) via explicitly decoupling the state-to-action mapping policy as a state transition predictor and an inverse dynamics model. Intuitively, we teach the agent to plan the target to go and then learn its own skills to reach. Experiments on standard benchmarks and a real-world driving dataset demonstrate the effectiveness of DPO and its potential of bridging the gap between reality and simulations of reinforcement learning.