no code implementations • 16 Aug 2021 • Zhao-Hua Li, Yang Yu, Yingfeng Chen, Ke Chen, Zhipeng Hu, Changjie Fan
The empirical results show that the proposed method can preserve a higher cumulative reward than behavior cloning and learn a more consistent policy to the original one.