This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process.
Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing.
Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents.
We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL.
We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning.
Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms.
Building on this insight and on previous results on transfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options.
The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning.
Sample efficiency and risk-awareness are central to the development of practical reinforcement learning (RL) for complex decision-making.
As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates.
Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.
The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence.
To test our hypothesis empirically, we augmented a standard deep RL agent with an auxiliary task of learning the value-improvement path.
In this paper we extend the SFs & GPI framework in two ways.
We focus on one aspect in particular, namely the ability to generalise to unseen tasks.
We propose Ephemeral Value Adjusments (EVA): a means of allowing deep reinforcement learning agents to rapidly adapt to experience in their replay buffer.
Some real-world domains are best characterized as a single task, but for others this perspective is limiting.
Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks.