With a general feedback graph, the observation of an arm may not be available when this arm is pulled, which makes the exploration more expensive and the algorithms more challenging to perform optimally in both environments.
In contextual bandit, one major challenge is to develop theoretically solid and empirically efficient algorithms for general function classes.
However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function.
Posterior sampling for reinforcement learning (PSRL) is a useful framework for making decisions in an unknown environment.
In this paper, we present Lazy-CFR, a CFR algorithm that adopts a lazy update strategy to avoid traversing the whole game tree in each round.
We provide insights into the relationship between $A^*$ sampling and probability matching by analyzing a nontrivial special case in which the state space is partitioned into two subsets.
In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR, as well as a novel analysis on the regret of CFR with lazy update.
Our evaluation results on various online scenarios show that BiLA can effectively infer the true labels, with an error rate reduction of at least 10 to 1. 5 percent points for synthetic and real-world datasets, respectively.
We study the problem on how to learn the pure Nash Equilibrium of a two-player zero-sum static game with random payoffs under unknown distributions via efficient payoff queries.