We employ Proximal Iteration for value-function optimization in deep reinforcement learning.
We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning.
We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types.
Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration.
In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks.
We show that the maximum action-value with respect to a deep RBVF can be approximated easily and accurately.
We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks.
In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions.
An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model.
When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes.
Learning a generative model is a key component of model-based reinforcement learning.
We go on to prove an error bound for the value-function estimate arising from Lipschitz models and show that the estimated value function is itself Lipschitz.
We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning.
Ranked #1 on Continuous Control on Cart Pole (OpenAI Gym)
End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors.
Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL).