no code implementations • 11 Feb 2025 • Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, Shangtong Zhang
Reinforcement learning (RL) agents typically optimize their policies by performing expensive backward passes to update their network parameters.
no code implementations • 31 Jan 2025 • Xinyu Liu, Zixuan Xie, Shangtong Zhang
As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.
no code implementations • 26 Nov 2024 • Amar Kulkarni, Shangtong Zhang, Madhur Behl
First CRASH can control adversarial Non Player Character (NPC) agents in an AV simulator to automatically induce collisions with the Ego vehicle, falsifying its motion planner.
no code implementations • 20 Nov 2024 • Xiaochi Qian, Zixuan Xie, Xinyu Liu, Shangtong Zhang
As applications, we provide the first almost sure convergence rate for $Q$-learning with Markovian samples without count-based learning rates.
no code implementations • 8 Oct 2024 • Claire Chen, Shuze Liu, Shangtong Zhang
In reinforcement learning, classic on-policy evaluation methods often suffer from high variance and require massive online data to attain the desired accuracy.
no code implementations • 3 Oct 2024 • Shuze Liu, Claire Chen, Shangtong Zhang
Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate.
no code implementations • 29 Sep 2024 • Ethan Blaser, Shangtong Zhang
Stochastic approximation is an important class of algorithms, and a large body of previous analysis focuses on stochastic approximations driven by contractive operators, which is not applicable in some important reinforcement learning settings.
no code implementations • 18 Sep 2024 • Jiuqi Wang, Shangtong Zhang
This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features.
no code implementations • 16 Aug 2024 • Shuze Daniel Liu, Claire Chen, Shangtong Zhang
To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately.
no code implementations • 22 May 2024 • Jiuqi Wang, Ethan Blaser, Hadi Daneshmand, Shangtong Zhang
The model is then able to output a label for the query instance according to the context during inference.
no code implementations • 15 Jan 2024 • Shuze Liu, Shuhang Chen, Shangtong Zhang
Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e. g., stochastic gradient descent and temporal difference learning.
1 code implementation • 7 Aug 2023 • Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals
StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution.
no code implementations • 2 Aug 2023 • Xiaochi Qian, Shangtong Zhang
In this paper, we revisit this $A^\top$TD and prove that a variant of $A^\top$TD, called $A_t^\top$TD, is also an effective solution to the deadly triad.
1 code implementation • 31 Jan 2023 • Shuze Liu, Shangtong Zhang
Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome.
no code implementations • 14 Feb 2022 • Shangtong Zhang, Remi Tachet, Romain Laroche
SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region.
1 code implementation • NeurIPS 2023 • Shangtong Zhang, Remi Tachet, Romain Laroche
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.
1 code implementation • 11 Aug 2021 • Shangtong Zhang, Shimon Whiteson
Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems.
no code implementations • 12 Jul 2021 • Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van Hasselt
We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting.
1 code implementation • 21 Jan 2021 • Shangtong Zhang, Hengshuai Yao, Shimon Whiteson
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
1 code implementation • 8 Jan 2021 • Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson
We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.
1 code implementation • 2 Oct 2020 • Shangtong Zhang, Romain Laroche, Harm van Seijen, Shimon Whiteson, Remi Tachet des Combes
In the second scenario, we consider optimizing a discounted objective ($\gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.
1 code implementation • NeurIPS 2020 • Shangtong Zhang, Vivek Veeriah, Shimon Whiteson
We present a Reverse Reinforcement Learning (Reverse RL) approach for representing retrospective knowledge.
1 code implementation • 22 Apr 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson
We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable.
1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson
Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution.
1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Hengshuai Yao, Shimon Whiteson
With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.
no code implementations • 13 May 2019 • Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yao-Liang Yu
In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties.
1 code implementation • 12 May 2019 • Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, Shangtong Zhang, Andrzej Wojcicki, Mai Xu
Intrinsic rewards were introduced to simulate how human intelligence works; they are usually evaluated by intrinsically-motivated play, i. e., playing games without extrinsic rewards but evaluated with extrinsic rewards.
1 code implementation • 3 May 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
We revisit residual algorithms in both model-free and model-based reinforcement learning settings.
Model-based Reinforcement Learning
reinforcement-learning
+2
2 code implementations • NeurIPS 2019 • Shangtong Zhang, Shimon Whiteson
We reformulate the option framework as two parallel augmented MDPs.
1 code implementation • NeurIPS 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting.
1 code implementation • 6 Nov 2018 • Shangtong Zhang, Hao Chen, Hengshuai Yao
In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control with a deterministic policy in reinforcement learning.
3 code implementations • 5 Nov 2018 • Shangtong Zhang, Borislav Mavrin, Linglong Kong, Bo Liu, Hengshuai Yao
In this paper, we propose the Quantile Option Architecture (QUOTA) for exploration based on recent advances in distributional reinforcement learning (RL).
1 code implementation • Journal of Open Source Software 2018 • Ryan R. Curtin, Marcus Edel, Mikhail Lozhnikov, Yannis Mentekidis, Sumedh Ghaisas, Shangtong Zhang
In the past several years, the field of machine learning has seen an explosion of interest and excitement, with hundreds or thousands of algorithms developed for different tasks every year.
4 code implementations • 4 Dec 2017 • Shangtong Zhang, Richard S. Sutton
Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.
no code implementations • 30 Nov 2017 • Shangtong Zhang, Osmar R. Zaiane
Reinforcement Learning and the Evolutionary Strategy are two major approaches in addressing complicated control problems.
no code implementations • 9 Dec 2016 • Vivek Veeriah, Shangtong Zhang, Richard S. Sutton
In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.