Search Results for author: Huizhen Yu

Found 11 papers, 1 papers with code

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

no code implementations5 Sep 2024 Huizhen Yu, Yi Wan, Richard S. Sutton

This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion.

Q-Learning reinforcement-learning +1

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

no code implementations29 Aug 2024 Yi Wan, Huizhen Yu, Richard S. Sutton

Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

Q-Learning Reinforcement Learning (RL)

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

no code implementations27 Dec 2017 Huizhen Yu

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation.

On Generalized Bellman Equations and Temporal-Difference Learning

no code implementations14 Apr 2017 Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton

As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.

Multi-step Off-policy Learning Without Importance Sampling Ratios

1 code implementation9 Feb 2017 Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner.

Some Simulation Results for Emphatic Temporal-Difference Learning Algorithms

no code implementations6 May 2016 Huizhen Yu

This is a companion note to our recent study of the weak convergence properties of constrained emphatic temporal-difference learning (ETD) algorithms from a theoretic perspective.

Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize

no code implementations23 Nov 2015 Huizhen Yu

In this paper we present convergence results for constrained versions of ETD($\lambda$) with constant stepsize and with diminishing stepsize from a broad range.

Emphatic Temporal-Difference Learning

no code implementations6 Jul 2015 A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton

Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.

On Convergence of Emphatic Temporal-Difference Learning

no code implementations8 Jun 2015 Huizhen Yu

We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces.

Cannot find the paper you are looking for? You can Submit a new open access paper.