Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

4 Mar 2019  ·  Hossein Aboutalebi, Doina Precup, Tibor Schuster ·

The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment allocation probabilities of patients. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data, especially when the sample size is small. In this paper, we define and study a new robustness criterion for bandit problems. Specifically, we consider optimizing a function of the distribution of returns as a regret measure. This provides practitioners more flexibility to define an appropriate regret measure. The learning algorithm we propose to solve this type of problem is a modification of the BESA algorithm [Baransi et al., 2014], which considers a more general version of regret. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset from the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.

PDF Abstract
No code implementations yet. Submit your code now



  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here