Multi-Armed Bandits

178 papers with code • 1 benchmarks • 2 datasets

Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off.

( Image credit: Microsoft Research )


Use these libraries to find Multi-Armed Bandits models and implementations
2 papers

Most implemented papers

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling

sb-ai-lab/RePlay 29 Oct 2018

The DRR framework treats recommendation as a sequential decision making procedure and adopts an "Actor-Critic" reinforcement learning scheme to model the interactions between the users and recommender systems, which can consider both the dynamic adaptation and long-term rewards.

Neural Contextual Bandits with UCB-based Exploration

sauxpa/neural_exploration ICML 2020

To the best of our knowledge, it is the first neural network-based contextual bandit algorithm with a near-optimal regret guarantee.

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

tensorflow/models ICLR 2018

At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical.

On-line Adaptative Curriculum Learning for GANs

Byte7/Adaptive-Curriculum-GAN-keras 31 Jul 2018

We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support.

Locally Differentially Private (Contextual) Bandits Learning

huang-research-group/LDPbandit2020 NeurIPS 2020

We study locally differentially private (LDP) bandits learning in this paper.

Neural Thompson Sampling

ZeroWeight/NeuralTS ICLR 2021

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems.

Online Limited Memory Neural-Linear Bandits with Likelihood Matching

mlisicki/neuralkernelbandits 7 Feb 2021

To alleviate this, we propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.

Off-Policy Evaluation for Large Action Spaces via Embeddings

st-tech/zr-obp 13 Feb 2022

Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance.

Multi-Armed Bandits in Metric Spaces

facebookresearch/Horizon 29 Sep 2008

In this work we study a very general setting for the multi-armed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric.

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

facebookresearch/ReAgent ICML 2017

We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model.