TD Learning with Constrained Gradients

ICLR 2018  ·  Ishan Durugkar, Peter Stone ·

Temporal Difference Learning with function approximation is known to be unstable. Previous work like \citet{sutton2009fast} and \citet{sutton2009convergent} has presented alternative objectives that are stable to minimize. However, in practice, TD-learning with neural networks requires various tricks like using a target network that updates slowly \citep{mnih2015human}. In this work we propose a constraint on the TD update that minimizes change to the target values. This constraint can be applied to the gradients of any TD objective, and can be easily applied to nonlinear function approximation. We validate this update by applying our technique to deep Q-learning, and training without a target network. We also show that adding this constraint on Baird's counterexample keeps Q-learning from diverging.

PDF Abstract
No code implementations yet. Submit your code now


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.