This article develops a deep reinforcement learning (Deep-RL) framework for dynamic pricing on managed lanes with multiple access locations and heterogeneity in travelers' value of time, origin, and destination.
Based on this framework, we propose a local reward approach called Shapley Q-value that can distribute the cumulative global rewards fairly, reflecting each agent's own contribution in contrast to the shared reward approach.
To accelerate the learning of policy gradient methods, we describe a novel off-policy learning framework and establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles.
Recently, a variety of methods have been developed for this problem, which generally try to learn effective representations of users and items and then match items to users according to their representations.
Recent approaches to question generation have used modifications to a Seq2Seq architecture inspired by advances in machine translation.
Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge.
The main idea is to use existing trajectories sampled by the policy gradient method to optimise a one-step improvement objective, yielding a sample and computationally efficient algorithm that is easy to implement.
Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies.
We find that adaptive optimizers have a narrow window of effective learning rates, diverging in other cases, and that the effectiveness of momentum varies depending on the properties of the environment.