Learning to Safely Exploit a Non-Stationary Opponent

In dynamic multi-player games, an effective way to exploit an opponent's weaknesses is to build a perfectly accurate opponent model. This renders the learning problem a single-agent optimization which can be solved by typical reinforcement learning. However, naive behavior cloning may not suffice to train an exploiting policy because opponents' behaviors are often non-stationary due to their adaptations in response to other agents' strategies. On the other hand, overfitting to an opponent (i.e., exploiting only one specific type of opponent) makes the learning player easily exploitable by others. To address the above problems, we propose a method named Exploit Policy-Space Opponent Model (EPSOM). In EPSOM, we model an opponent's non-stationarity by a series of transitions among different policies, and formulate such a transition process through non-parametric Bayesian methods. To account for the trade-off between exploitation and exploitability, we train a player to learn a robust best response against the opponent's predicted strategy by solving a modified meta-game in the policy space. In this work, we consider a two-player zero-sum game setting and evaluate EPSOM on Kuhn poker; results suggest that our method is capable of exploiting its adaptive opponent, whilst maintaining low exploitability (i.e., achieving safe opponent exploitation). Furthermore, we show that our EPSOM agent has strong performance against unknown non-stationary opponents without further training.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here