no code implementations • 8 Dec 2023 • Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, Adam Wierman
Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function.
no code implementations • 28 Mar 2023 • Zaiwei Chen, Siva Theja Maguluri, Martin Zubeldia
To demonstrate the applicability of our theoretical results, we use them to provide maximal concentration bounds for a large class of reinforcement learning algorithms, including but not limited to on-policy TD-learning with linear function approximation, off-policy TD-learning with generalized importance sampling factors, and $Q$-learning.
1 code implementation • 8 Mar 2023 • Zhaoyi Zhou, Zaiwei Chen, Yiheng Lin, Adam Wierman
The algorithm is scalable since each agent uses only local information and does not need access to the global state.
no code implementations • 30 Nov 2022 • Yizhou Zhang, Guannan Qu, Pan Xu, Yiheng Lin, Zaiwei Chen, Adam Wierman
In particular, we show that, despite restricting each agent's attention to only its $\kappa$-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in $\kappa$.
Multi-agent Reinforcement Learning reinforcement-learning +1
no code implementations • 5 Aug 2022 • Zaiwei Chen, Siva Theja Maguluri
Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.
no code implementations • 5 Mar 2022 • Zaiwei Chen, John Paul Clarke, Siva Theja Maguluri
$Q$-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms, and was identified in Sutton (1999) as one of the most important theoretical open problems in the RL community.
no code implementations • 11 Nov 2021 • Zaiwei Chen, Shancong Mou, Siva Theja Maguluri
In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero.
no code implementations • NeurIPS 2021 • Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam
Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor.
no code implementations • 26 May 2021 • Zaiwei Chen, Sajad Khodadadian, Siva Theja Maguluri
In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(\epsilon^{-3})$, outperforming all the previously known convergence bounds of such algorithms.
no code implementations • 18 Feb 2021 • Sajad Khodadadian, Zaiwei Chen, Siva Theja Maguluri
In this paper, we provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling.
no code implementations • 2 Feb 2021 • Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam
As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i. e., efficiency of bootstrapping in RL.
no code implementations • NeurIPS 2020 • Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam
In particular, we use it to establish the first-known convergence rate of the V-trace algorithm for off-policy TD-learning.
1 code implementation • 27 May 2019 • Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clarke, Siva Theja Maguluri
To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy.