Diffusing Policies : Towards Wasserstein Policy Gradient Flows

ICLR 2018  ·  Pierre H. Richemond, Brendan Maginnis ·

Policy gradients methods often achieve better performance when the change in policy is limited to a small Kullback-Leibler divergence. We derive policy gradients where the change in policy is limited to a small Wasserstein distance (or trust region)... This is done in the discrete and continuous multi-armed bandit settings with entropy regularisation. We show that in the small steps limit with respect to the Wasserstein distance $W_2$, policy dynamics are governed by the heat equation, following the Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and advection, concentrating near actions with high reward. This helps elucidate the nature of convergence in the probability matching setup, and provides justification for empirical practices such as Gaussian policy priors and additive gradient noise. read more

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here