Soft Q Network

20 Dec 2019  ·  Jingbin Liu, Shuai Liu, Xinyang Gu ·

Deep Q Network (DQN) is a very successful algorithm, yet the inherent problem of reinforcement learning, i.e. the exploit-explore balance, remains. In this work, we introduce entropy regularization into DQN and propose SQN. We find that the backup equation of soft Q learning can enjoy the corrective feedback if we view the soft backup as policy improvement in the form of Q, instead of policy evaluation. We show that Soft Q Learning with Corrective Feedback (SQL-CF) underlies the on-plicy nature of SQL and the equivalence of SQL and Soft Policy Gradient (SPG). With these insights, we propose an on-policy version of deep Q learning algorithm, i.e. Q On-Policy (QOP). We experiment with QOP on a self-play environment called Google Research Football (GRF). The QOP algorithm exhibits great stability and efficiency in training GRF agents.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods