Online Reinforcement Learning in Stochastic Games

NeurIPS 2017  ·  Chen-Yu Wei, Yi-Te Hong, Chi-Jen Lu ·

We study online reinforcement learning in average-reward stochastic games (SGs). An SG models a two-player zero-sum game in a Markov environment, where state transitions and one-step payoffs are determined simultaneously by a learner and an adversary. We propose the UCSG algorithm that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent. This result improves previous ones under the same setting. The regret bound has a dependency on the diameter, which is an intrinsic value related to the mixing property of SGs. If we let the opponent play an optimistic best response to the learner, UCSG finds an $\varepsilon$-maximin stationary policy with a sample complexity of $\tilde{\mathcal{O}}\left(\text{poly}(1/\varepsilon)\right)$, where $\varepsilon$ is the gap to the best policy.

PDF Abstract NeurIPS 2017 PDF NeurIPS 2017 Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here