no code implementations • 27 Sep 2018 • YuJun Li, Chengzhuo Ni, Guangzeng Xie, Wenhao Yang, Shuchang Zhou, Zhihua Zhang
A2VI is more efficient than the modified policy iteration, which is a classical approximate method for policy evaluation.
no code implementations • NeurIPS 2019 • Xiang Li, Wenhao Yang, Zhihua Zhang
We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term.
2 code implementations • ICLR 2020 • Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, Zhihua Zhang
In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs.
no code implementations • 21 Oct 2019 • Xiang Li, Wenhao Yang, Shusen Wang, Zhihua Zhang
Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication.
no code implementations • 31 Oct 2020 • Wenhao Yang, Xiang Li, Guangzeng Xie, Zhihua Zhang
Regularized MDPs serve as a smooth version of original MDPs.
no code implementations • 9 May 2021 • Wenhao Yang, Liangyu Zhang, Zhihua Zhang
In this paper, we study the non-asymptotic and asymptotic performances of the optimal robust policy and value function of robust Markov Decision Processes(MDPs), where the optimal robust policy and value function are solved only from a generative model.
1 code implementation • 29 Dec 2021 • Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, Michael I. Jordan
We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings.
1 code implementation • 6 Apr 2022 • Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, Zhihua Zhang
We study a Federated Reinforcement Learning (FedRL) problem in which $n$ agents collaboratively learn a single policy without sharing the trajectories they collected during agent-environment interaction.
no code implementations • 18 May 2022 • Xiaobo Xia, Wenhao Yang, Jie Ren, Yewen Li, Yibing Zhan, Bo Han, Tongliang Liu
Second, the constraints for diversity are designed to be task-agnostic, which causes the constraints to not work well.
no code implementations • 27 May 2022 • Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári
In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model.
no code implementations • 12 Sep 2022 • Miao Lu, Wenhao Yang, Liangyu Zhang, Zhihua Zhang
Specifically, we propose a two-stage estimator based on the instrumental variables and establish its statistical properties in the confounded MDPs with a linear structure.
no code implementations • 2 Feb 2023 • Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M. Jordan, Zhihua Zhang
Moreover, we prove the alternative form still plays a similar role as the original form.
1 code implementation • 29 Apr 2023 • Liangyu Zhang, Yang Peng, Wenhao Yang, Zhihua Zhang
To the best of our knowledge, we are the first to apply tools from semi-infinitely programming (SIP) to solve constrained reinforcement learning problems.
no code implementations • 19 May 2023 • Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bing Wang, Haihong Tang, Yuanyu Wan, Lijun Zhang
Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named $\text{BOGD}_\text{IP}$, and establish an $\mathcal{O}(T^{3/4}(1+P_T))$ dynamic regret bound, where $P_T$ denotes the path-length of the comparator sequence.
1 code implementation • 22 May 2023 • Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári, Wataru Kumagai, Yutaka Matsuo
Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms.
1 code implementation • 29 Sep 2023 • Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang
This implies the distributional policy evaluation problem can be solved with sample efficiency.
Distributional Reinforcement Learning reinforcement-learning