A short variational proof of equivalence between policy gradients and soft Q learning

22 Dec 2017  ·  Pierre H. Richemond, Brendan Maginnis ·

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function... Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning. read more

PDF Abstract
No code implementations yet. Submit your code now


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.