The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.
( Image credit: Playing Atari with Deep Reinforcement Learning )
We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization.
We present a framework, which we call Molecule Deep $Q$-Networks (MolDQN), for molecule optimization by combining domain knowledge of chemistry and state-of-the-art reinforcement learning techniques (double $Q$-learning and randomized value functions).
Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator.
Ranked #1 on Atari Games on Atari 2600 Elevator Action
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies.
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.
Here, we propose a novel test-bed platform for reinforcement learning research from raw visual information which employs the first-person perspective in a semi-realistic 3D world.
Ranked #1 on Game of Doom on ViZDoom Basic Scenario