In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property.
We study a posterior sampling approach to efficient exploration in constrained reinforcement learning.
In this article, we provide an extensive overview of fairness approaches that have been implemented via a reinforcement learning (RL) framework.
Our main theoretical results show that the impact of batch learning is a multiplicative factor of batch size relative to the regret of online behavior.
We consider a setting in which the objective is to learn to navigate in a controlled Markov process (CMP) where transition probabilities may abruptly change.
We consider reinforcement learning in changing Markov Decision Processes where both the state-transition probabilities and the reward functions may vary over time.
Counterfactual learning is a natural scenario to improve web-based machine translation services by offline learning from feedback logged during user interactions.
In this framework, motivated by privacy preservation in online recommender systems, the goal is to maximize the sum of the (unobserved) rewards, based on the observation of transformation of these rewards through a stochastic corruption process with known parameters.
We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms.