This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later.
Ranked #2 on Atari Games 100k on Atari 100k
Drawing inspiration from gradient-based meta-learning methods with infinitely small gradient steps, we introduce Continuous-Time Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments.
The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors.
We find that prior approaches either assume that the environment is provided in such a tabular form -- which is highly restrictive -- or infer "local neighbourhoods" of states to run value iteration over -- for which we discover an algorithmic bottleneck effect.
We develop a multiple shooting method for learning in deep neural networks based on the Lagrangian perspective on automatic differentiation.
The shortcomings of maximum likelihood estimation in the context of model-based reinforcement learning have been highlighted by an increasing number of papers.
Value Iteration Networks (VINs) have emerged as a popular method to incorporate planning algorithms within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics.
Previously, such planning components have been incorporated through a neural network that partially aligns with the computational graph of value iteration.
We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers.
Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time.
In this work, we propose exploration in policy gradient methods based on maximizing entropy of the discounted future state distribution.
While often stated as an instance of the likelihood ratio trick [Rubinstein, 1989], the original policy gradient theorem [Sutton, 1999] involves an integral over the action space.
Surprisingly, we find that in finite horizon MDPs there is no strict variance reduction of per-decision importance sampling or stationary importance sampling, comparing with vanilla importance sampling.
We want to make progress toward artificial general intelligence, namely general-purpose agents that autonomously learn how to competently act in complex environments.
We present a Robust Options Policy Iteration (ROPI) algorithm with convergence guarantees, which learns options that are robust to model uncertainty.
We present new results on learning temporally extended actions for continuoustasks, using the options framework (Suttonet al.[1999b], Precup ).
Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations.
Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance.
Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy.
We show that the Bellman operator underlying the options framework leads to a matrix splitting, an approach traditionally used to speed up convergence of iterative solvers for large linear systems of equations.