This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting.
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.
We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate.
This approach achieves a linear increase of the number of network outputs with the number of degrees of freedom by allowing a level of independence for each individual action dimension.
To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value.
In particular, we augment DQN and DDPG with multiplicative normalizing flows in order to track a rich approximate posterior distribution over the parameters of the value function.
This objective encourages the agent to maximize the expected return, as well as to achieve more diverse goals.
The model-based power allocation algorithm has been investigated for decades, but it requires the mathematical models to be analytically tractable and it usually has high computational complexity.
Information Theory Information Theory
Policy gradient methods have had great success in solving continuous control tasks, yet the stochastic nature of such problems makes deterministic value estimation difficult.
Deep reinforcement learning offers a model-free alternative to supervised deep learning and classical optimization for solving the transmit power control problem in wireless networks.