no code implementations • 5 Sep 2024 • Huizhen Yu, Yi Wan, Richard S. Sutton
This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion.
no code implementations • 29 Aug 2024 • Yi Wan, Huizhen Yu, Richard S. Sutton
Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.
no code implementations • 22 Dec 2023 • Huizhen Yu, Yi Wan, Richard S. Sutton
In this paper, we study asynchronous stochastic approximation algorithms without communication delays.
no code implementations • 18 May 2018 • Sina Ghiassian, Huizhen Yu, Banafsheh Rafiee, Richard S. Sutton
We apply neural nets with ReLU gates in online reinforcement learning.
no code implementations • 27 Dec 2017 • Huizhen Yu
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation.
no code implementations • 14 Apr 2017 • Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton
As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.
1 code implementation • 9 Feb 2017 • Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton
We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner.
no code implementations • 6 May 2016 • Huizhen Yu
This is a companion note to our recent study of the weak convergence properties of constrained emphatic temporal-difference learning (ETD) algorithms from a theoretic perspective.
no code implementations • 23 Nov 2015 • Huizhen Yu
In this paper we present convergence results for constrained versions of ETD($\lambda$) with constant stepsize and with diminishing stepsize from a broad range.
no code implementations • 6 Jul 2015 • A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.
no code implementations • 8 Jun 2015 • Huizhen Yu
We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces.