no code implementations • 20 Apr 2023 • Wesley Cowan, Michael N. Katehakis, Sheldon M. Ross
We study new types of dynamic allocation problems the {\sl Halting Bandit} models.
no code implementations • 28 Sep 2019 • Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky
In this paper we derive an efficient method for computing the indices associated with an asymptotically optimal upper confidence bound algorithm (MDP-UCB) of Burnetas and Katehakis (1997) that only requires solving a system of two non-linear equations with two unknowns, irrespective of the cardinality of the state space of the Markovian decision process (MDP).
no code implementations • 13 Sep 2019 • Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky
In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities.
no code implementations • 30 Nov 2018 • Apostolos N. Burnetas, Odysseas Kanavetas, Michael N. Katehakis
This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints.
no code implementations • 22 Oct 2015 • Michael N. Katehakis, Jian Yang, Tingting Zhou
Inventory control with unknown demand distribution is considered, with emphasis placed on the case involving discrete nonperishable items.
no code implementations • 7 Oct 2015 • Wesley Cowan, Michael N. Katehakis
We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from a finite number of $N \geq 2$ populations, specified by unknown distributions.
no code implementations • 9 Sep 2015 • Apostolos N. Burnetas, Odysseas Kanavetas, Michael N. Katehakis
Then we construct a class of f-UF policies and provide conditions under which they are asymptotically optimal within the class of f-UF policies, achieving this asymptotic lower bound.
no code implementations • 12 May 2015 • Wesley Cowan, Michael N. Katehakis
The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies $\pi$.
no code implementations • 8 May 2015 • Wesley Cowan, Michael N. Katehakis
The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1, 2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$.
no code implementations • 22 Apr 2015 • Wesley Cowan, Junya Honda, Michael N. Katehakis
Consider the problem of sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled.