Particle Based Stochastic Policy Optimization

29 Sep 2021  ·  Qiwei Ye, Yuxuan Song, Chang Liu, Fangyun Wei, Tao Qin, Tie-Yan Liu ·

Stochastic polic have been widely applied for their good property in exploration and uncertainty quantification. Modeling policy distribution by joint state-action distribution within the exponential family has enabled flexibility in exploration and learning multi-modal policies and also involved the probabilistic perspective of deep reinforcement learning (RL). The connection between probabilistic inference and RL makes it possible to leverage the advancements of probabilistic optimization tools. However, recent efforts are limited to the minimization of reverse KLdivergence which is confidence-seeking and may fade the merit of a stochastic policy. To leverage the full potential of stochastic policy and provide more flexible property, there is a strong motivation to consider different update rules during policy optimization. In this paper, we propose a particle-based probabilistic pol-icy optimization framework, ParPI, which enables the usage of a broad family of divergence or distances, such asf-divergences, and the Wasserstein distance which could serve better probabilistic behavior of the learned stochastic policy. Experiments in both online and offline settings demonstrate the effectiveness of the proposed algorithm as well as the characteristics of different discrepancy measures for policy optimization.

PDF Abstract


Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
MuJoCo Games Ant-v3 ParPI Average Reward 5142 # 1
MuJoCo Games HalfCHeetah-v3 ParPI Average Reward 11738 # 1
MuJoCo Games Hopper-v3 ParPI Average Reward 3042 # 1
MuJoCo Games Humanoid-v3 ParPI Average Reward 4912 # 1
Offline RL Walker2d ParPI D4RL Normalized Score 151.4 # 1
MuJoCo Games Walker2d-v3 ParPI Average Reward 5201 # 1


No methods listed for this paper. Add relevant methods here