Thompson Sampling
96 papers with code • 0 benchmarks • 0 datasets
Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
Benchmarks
These leaderboards are used to track progress in Thompson Sampling
Most implemented papers
Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem.
Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
By parameterizing our learned model with a neural network, we are able to develop a scalable and efficient approach to exploration bonuses that can be applied to tasks with complex, high-dimensional state spaces.
Cascading Bandits for Large-Scale Recommendation Problems
In this work, we study cascading bandits, an online learning variant of the cascade model where the goal is to recommend $K$ most attractive items from a large set of $L$ candidate items.
Double Thompson Sampling for Dueling Bandits
This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case.
Stacked Thompson Bandits
We introduce Stacked Thompson Bandits (STB) for efficiently generating plans that are likely to satisfy a given bounded temporal logic requirement.
Mostly Exploration-Free Algorithms for Contextual Bandits
We prove that this algorithm is rate optimal without any additional assumptions on the context distribution or the number of arms.
AIXIjs: A Software Demo for General Reinforcement Learning
The universal Bayesian agent AIXI (Hutter, 2005) is a model of a maximally intelligent agent, and plays a central role in the sub-field of general reinforcement learning (GRL).
Asynchronous Parallel Bayesian Optimisation via Thompson Sampling
We design and analyse variations of the classical Thompson sampling (TS) procedure for Bayesian optimisation (BO) in settings where function evaluations are expensive, but can be performed in parallel.
Variational inference for the multi-armed contextual bandit
One general class of algorithms for optimizing interactions with the world, while simultaneously learning how the world operates, is the multi-armed bandit setting and, in particular, the contextual bandit case.
Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling
Reinforcement learning studies how to balance exploration and exploitation in real-world systems, optimizing interactions with the world while simultaneously learning how the world operates.