Dota 2 is a multiplayer online battle arena (MOBA). The task is to train one-or-more agents to play and win the game.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.
Even though death events are rare within a game (1\% of the data), the model achieves 0. 377 precision with 0. 725 recall on test data when prompted to predict which of any of the 10 players of either team will die within 5 seconds.