178 papers with code • 1 benchmarks • 2 datasets
Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.
( Image credit: Microsoft Research )
LibrariesUse these libraries to find Multi-Armed Bandits models and implementations
The DRR framework treats recommendation as a sequential decision making procedure and adopts an "Actor-Critic" reinforcement learning scheme to model the interactions between the users and recommender systems, which can consider both the dynamic adaptation and long-term rewards.
Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling
At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical.
To alleviate this, we propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance.
We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model.