no code implementations • 5 Sep 2024 • Huizhen Yu, Yi Wan, Richard S. Sutton

This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion.

no code implementations • 29 Aug 2024 • Yi Wan, Huizhen Yu, Richard S. Sutton

Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

no code implementations • 21 Jun 2024 • Kris de Asis, Richard S. Sutton

Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps.

1 code implementation • 16 May 2024 • Abhishek Naik, Yi Wan, Manan Tomar, Richard S. Sutton

We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average.

no code implementations • 22 Dec 2023 • Huizhen Yu, Yi Wan, Richard S. Sutton

In this paper, we study asynchronous stochastic approximation algorithms without communication delays.

no code implementations • 2 Oct 2023 • Kenny Young, Richard S. Sutton

Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains.

no code implementations • 27 Jun 2023 • Kristopher De Asis, Eric Graves, Richard S. Sutton

Importance sampling is a central idea underlying off-policy prediction in reinforcement learning.

1 code implementation • 23 Jun 2023 • Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton

If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples.

no code implementations • 30 Sep 2022 • Yi Wan, Richard S. Sutton

We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs.

no code implementations • 23 Aug 2022 • Richard S. Sutton, Michael Bowling, Patrick M. Pilarski

Herein we describe our approach to artificial intelligence research, which we call the Alberta Plan.

no code implementations • 4 Jul 2022 • Tian Tian, Kenny Young, Richard S. Sutton

However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space.

no code implementations • 25 May 2022 • Yi Wan, Richard S. Sutton

In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.

no code implementations • 26 Feb 2022 • Richard S. Sutton

It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

no code implementations • 20 Feb 2022 • Richard S. Sutton

The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.

no code implementations • 7 Feb 2022 • Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White

Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process.

Model-based Reinforcement Learning
reinforcement-learning
**+2**

no code implementations • 30 Dec 2021 • Amir Samani, Richard S. Sutton

Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data.

no code implementations • NeurIPS 2021 • Yi Wan, Abhishek Naik, Richard S. Sutton

We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs.

no code implementations • 10 Sep 2021 • Sina Ghiassian, Richard S. Sutton

In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two.

1 code implementation • 13 Aug 2021 • Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former.

2 code implementations • 2 Jun 2021 • Sina Ghiassian, Richard S. Sutton

In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter.

no code implementations • 17 Apr 2021 • Katya Kudashkina, Yi Wan, Abhishek Naik, Richard S. Sutton

Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.

1 code implementation • 15 Feb 2021 • Dylan R. Ashley, Sina Ghiassian, Richard S. Sutton

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon.

1 code implementation • 8 Jan 2021 • Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.

no code implementations • 1 Jan 2021 • Kristopher De Asis, Alan Chan, Yi Wan, Richard S. Sutton

Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return.

no code implementations • 28 Oct 2020 • Kenny Young, Richard S. Sutton

We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.

no code implementations • 27 Aug 2020 • Katya Kudashkina, Patrick M. Pilarski, Richard S. Sutton

In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning.

Model-based Reinforcement Learning
reinforcement-learning
**+2**

no code implementations • 26 Aug 2020 • Alan Chan, Kris de Asis, Richard S. Sutton

In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function.

2 code implementations • 29 Jun 2020 • Yi Wan, Abhishek Naik, Richard S. Sutton

We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset.

no code implementations • 9 Dec 2019 • J. Fernando Hernandez-Garcia, Richard S. Sutton

Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward.

no code implementations • 4 Oct 2019 • Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao, Richard S. Sutton

Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks.

no code implementations • 9 Sep 2019 • Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps.

no code implementations • 2 Apr 2019 • Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton

In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.

no code implementations • 8 Mar 2019 • Alex Kearney, Vivek Veeriah, Jaden Travnik, Patrick M. Pilarski, Richard S. Sutton

In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent---building on a variety of prior work in stochastic approximation, machine learning, and artificial neural networks.

1 code implementation • 1 Mar 2019 • Xiang Gu, Sina Ghiassian, Richard S. Sutton

ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training.

1 code implementation • 22 Jan 2019 • J. Fernando Hernandez-Garcia, Richard S. Sutton

Our results show that (1) using off-policy correction can have an adverse effect on the performance of Sarsa and $Q(\sigma)$; (2) increasing the backup length $n$ consistently improved performance across all the different algorithms; and (3) the performance of Sarsa and $Q$-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, $Q(\sigma)$, and Retrace in this particular task.

no code implementations • 6 Nov 2018 • Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White

The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades.

no code implementations • 20 Sep 2018 • Kristopher De Asis, Brendan Bennett, Richard S. Sutton

Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning.

no code implementations • 5 Jul 2018 • Kristopher De Asis, Richard S. Sutton

Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme.

no code implementations • ICLR 2018 • Kenny J. Young, Richard S. Sutton, Shuo Yang

We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful.

no code implementations • 18 May 2018 • Sina Ghiassian, Huizhen Yu, Banafsheh Rafiee, Richard S. Sutton

We apply neural nets with ReLU gates in online reinforcement learning.

no code implementations • 10 Apr 2018 • Alex Kearney, Vivek Veeriah, Jaden B. Travnik, Richard S. Sutton, Patrick M. Pilarski

In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning.

no code implementations • 16 Feb 2018 • Jaden B. Travnik, Kory W. Mathewson, Richard S. Sutton, Patrick M. Pilarski

The relationship between a reinforcement learning (RL) agent and an asynchronous environment is often ignored.

no code implementations • 25 Jan 2018 • Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam White, Martha White, Richard S. Sutton

This paper investigates estimating the variance of a temporal-difference learning agent's update target.

4 code implementations • 4 Dec 2017 • Shangtong Zhang, Richard S. Sutton

Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.

no code implementations • 10 Nov 2017 • Patrick M. Pilarski, Richard S. Sutton, Kory W. Mathewson, Craig Sherstan, Adam S. R. Parker, Ann L. Edwards

This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness.

no code implementations • 11 May 2017 • Sina Ghiassian, Banafsheh Rafiee, Richard S. Sutton

In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem.

no code implementations • 10 May 2017 • Adam White, Richard S. Sutton

This document should serve as a quick reference for and guide to the implementation of linear GQ($\lambda$), a gradient-based off-policy temporal-difference learning algorithm.

1 code implementation • 9 May 2017 • Jaeyoung Lee, Richard S. Sutton

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem.

no code implementations • 14 Apr 2017 • Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton

As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.

no code implementations • 3 Mar 2017 • Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton

These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance.

1 code implementation • 9 Feb 2017 • Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner.

no code implementations • 9 Dec 2016 • Vivek Veeriah, Shangtong Zhang, Richard S. Sutton

In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.

no code implementations • 9 Jun 2016 • Vivek Veeriah, Patrick M. Pilarski, Richard S. Sutton

The primary objective of the current work is to demonstrate that a learning agent can reduce the amount of explicit feedback required for adapting to the user's preferences pertaining to a task by learning to perceive a value of its behavior from the human user, particularly from the user's facial expressions---we call this face valuing.

1 code implementation • 13 Dec 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton

Our results suggest that the true online methods indeed dominate the regular methods.

no code implementations • 19 Aug 2015 • Hado van Hasselt, Richard S. Sutton

If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed.

no code implementations • 25 Jul 2015 • Richard S. Sutton

This document is a guide to the implementation of true online emphatic TD($\lambda$), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014).

no code implementations • 6 Jul 2015 • A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton

Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.

no code implementations • 1 Jul 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Richard S. Sutton

Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size.

no code implementations • NeurIPS 2004 • Richard S. Sutton, Brian Tanner

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions.

no code implementations • 14 Mar 2015 • Richard S. Sutton, A. Rupam Mahmood, Martha White

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps.

no code implementations • NeurIPS 2014 • Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar

We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.

no code implementations • NeurIPS 2014 • A. Rupam Mahmood, Hado P. Van Hasselt, Richard S. Sutton

Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda).

no code implementations • 18 Sep 2013 • Ann L. Edwards, Alexandra Kearney, Michael Rory Dawson, Richard S. Sutton, Patrick M. Pilarski

In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm.

no code implementations • 13 Jun 2012 • Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling

Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.

1 code implementation • 22 May 2012 • Thomas Degris, Martha White, Richard S. Sutton

Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning.

no code implementations • 6 Dec 2011 • Joseph Modayil, Adam White, Richard S. Sutton

The term "nexting" has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense.

no code implementations • NeurIPS 2009 • Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks.

no code implementations • NeurIPS 2009 • Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári

We extend Dyna planning architecture for policy evaluation and control in two significant aspects.

no code implementations • NeurIPS 2008 • Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters.

no code implementations • NeurIPS 2008 • Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses.

1 code implementation • Artificial Intelligence 1999 • Richard S. Sutton, Doina Precup, Satinder Singh

In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning.

1 code implementation • Machine Learning 1988 • Richard S. Sutton

This article introduces a class of incremental learning procedures specialized for prediction that is, for using past experience with an incompletely known system to predict its future behavior.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.