no code implementations • 16 Jun 2023 • Charline Le Lan, Stephen Tu, Mark Rowland, Anna Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney
In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988).
no code implementations • 29 May 2023 • Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Rémi Munos, Bernardo Ávila Pires, Michal Valko
Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings.
no code implementations • 11 Jan 2023 • Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney
We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning.
Distributional Reinforcement Learning reinforcement-learning +1
no code implementations • NeurIPS 2021 • David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael L. Littman, Doina Precup, Satinder Singh
We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists.
no code implementations • 1 Jan 2021 • Thomas Mesnard, Theophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Marcus Hutter, Lars Holger Buesing, Remi Munos
Credit assignment in reinforcement learning is the problem of measuring an action’s influence on future rewards.
no code implementations • 18 Nov 2020 • Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, Éric Moulines, Marcus Hutter, Lars Buesing, Rémi Munos
Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards.
no code implementations • 2 Nov 2020 • Paniz Behboudian, Yash Satsangi, Matthew E. Taylor, Anna Harutyunyan, Michael Bowling
Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered.
1 code implementation • NeurIPS 2019 • Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, Remi Munos
We consider the problem of efficient credit assignment in reinforcement learning.
no code implementations • 16 Oct 2019 • Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, Will Dabney
We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.
no code implementations • 26 Feb 2019 • Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, Doina Precup
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents.
no code implementations • 10 Nov 2017 • Anna Harutyunyan, Peter Vrancx, Pierre-Luc Bacon, Doina Precup, Ann Nowe
Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient.
no code implementations • 22 Aug 2017 • Denis Steckelmacher, Diederik M. Roijers, Anna Harutyunyan, Peter Vrancx, Hélène Plisnier, Ann Nowé
Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability.
3 code implementations • NeurIPS 2016 • Rémi Munos, Tom Stepleton, Anna Harutyunyan, Marc G. Bellemare
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning.
no code implementations • 16 Feb 2016 • Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Remi Munos
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities.
no code implementations • 11 Feb 2015 • Anna Harutyunyan, Tim Brys, Peter Vrancx, Ann Nowe
While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale.
no code implementations • 21 May 2014 • Anna Harutyunyan, Tim Brys, Peter Vrancx, Ann Nowe
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency.