Sample-efficient actor-critic algorithms with an etiquette for zero-sum Markov games

29 Sep 2021 · Ahmet Alacaoglu, Luca Viano, Niao He, Volkan Cevher ·

We introduce algorithms based on natural policy gradient and two time-scale natural actor-critic, and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization literature. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. On the other hand, the agents only access to joint reward and joint next state and not to each other's actions or policies. Our sample complexities also match the best-known results for global convergence of policy gradient and two time-scale actor-critic algorithms in the single agent setting. We provide numerical verification of our method for a two-player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent ascent variant for Markov games.

PDF Abstract