Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical.
SOTA for Multi-Armed Bandits on Mushroom
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action.
This work explores adaptations of successful multi-armed bandits policies to the online contextual bandits scenario with binary rewards using binary classification algorithms such as logistic regression as black-box oracles.
In this survey we cover a few stochastic and adversarial contextual bandit algorithms.
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems.
We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support.
Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection.