Average Reward Reinforcement Learning with Monotonic Policy Improvement

1 Jan 2021  ·  Yiming Zhang, Keith W. Ross ·

In continuing control tasks, an agent’s average reward per time step is a more natural performance measure compared to the commonly used discounting framework as it can better capture an agent’s long-term behavior. We derive a novel lower bound on the difference of the average rewards for two policies, where the lower bound depends on the average divergence between the policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al.,2017) result in a trivial lower bound in the average reward setting. We develop an iterative procedure based on our lower bound which produces a sequence of monotonically improved policies for the average reward criterion. When combined with deep reinforcement learning methods, the procedure leads to scalable and efficient algorithms aimed at maximizing an agent’s average reward performance. Empirically, we demonstrate the efficacy of our algorithms through a series of high-dimensional control tasks with long time horizons and show that discounting can lead to unsatisfactory performance on continuing control tasks.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here