Motivated by the demand for an effective deep reinforcement learning algorithm that accommodates sparse reward environment, this paper presents Hindsight Trust Region Policy Optimization (HTRPO), a method that efficiently utilizes interactions in sparse reward conditions to optimize policies within trust region and, in the meantime, maintains learning stability.
This article develops a deep reinforcement learning (Deep-RL) framework for dynamic pricing on managed lanes with multiple access locations and heterogeneity in travelers' value of time, origin, and destination.
To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value.
Recently, a variety of methods have been developed for this problem, which generally try to learn effective representations of users and items and then match items to users according to their representations.
We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions.
Recent approaches to question generation have used modifications to a Seq2Seq architecture inspired by advances in machine translation.
#3 best model for Question Generation on SQuAD1.1
Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge.