Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.
( Image credit: Microsoft Research )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Five categories of methods are described, making it easy to choose how to address sparse data using contextual bandits with a method available for modification in the specific setting of concern.
The sample complexities of our algorithms depend, in particular, on the size of the optimal hitting set of the given intervals.
We propose to assess the fairness of personalized recommender systems in the sense of envy-freeness: every (group of) user(s) should prefer their recommendations to the recommendations of other (groups of) users.
However there is a lack of general methods for conducting statistical inference using more complex models.
In this work, we formulate the SCB that uses a DNN reward function as a non-convex stochastic optimization problem, and design a stage-wise stochastic gradient descent algorithm to optimize the problem and determine the action policy.
We show that the regret is indeed never worse than the regret obtained by running LinUCB on the best representation (up to a $\ln M$ factor).
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We propose a novel algorithm for multi-player multi-armed bandits without collision sensing information.