A Risk-Sensitive Policy Gradient Method

29 Sep 2021  ·  Jared Markowitz, Ryan Gardner, Ashley Llorens, Raman Arora, I-Jeng Wang ·

Standard deep reinforcement learning (DRL) agents aim to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also wastes an opportunity for the agent to modulate behavior based on distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach, whereby the distribution of full-episode outcomes is optimized to maximize a chosen function of its cumulative distribution function (CDF). This technique allows for outcomes to be weighed based on relative quality, does not require modification of the reward function to modulate agent behavior, and may be used for both continuous and discrete action spaces. We show how to achieve an unbiased estimate of the policy gradient for a broad class of CDF-based objectives via sampling, subsequently incorporating variance reduction measures to facilitate effective on-policy learning. We use the resulting approach to train agents with different “risk profiles” in penalty-based formulations of six OpenAI Safety Gym environments, finding that moderate emphasis on improvement in training scenarios where the agent performs poorly generally improves agent behavior. We interpret and explore this observation, which leads to improved performance over the widely-used Proximal Policy Optimization algorithm in all environments tested.

PDF Abstract
No code implementations yet. Submit your code now


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here