Sample-efficient actor-critic algorithms with an etiquette for zero-sum Markov games

29 Sep 2021  ·  Ahmet Alacaoglu, Luca Viano, Niao He, Volkan Cevher ·

We introduce algorithms based on natural policy gradient and two time-scale natural actor-critic, and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization literature. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. On the other hand, the agents only access to joint reward and joint next state and not to each other's actions or policies. Our sample complexities also match the best-known results for global convergence of policy gradient and two time-scale actor-critic algorithms in the single agent setting. We provide numerical verification of our method for a two-player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent ascent variant for Markov games.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here