We study conditional risk minimization (CRM), i. e. the problem of learning a hypothesis of minimal risk for prediction at the next step of sequentially arriving dependent data.
In particular, we aim at learning predictors that minimize the conditional risk for a stochastic process, i. e. the expected loss of the predictor on the next point conditioned on the set of training samples observed so far.
We consider the problem of minimizing the regret in stochastic multi-armed bandit, when the measure of goodness of an arm is not the mean return, but some general function of the mean and the variance. We characterize the conditions under which learning is possible and present examples for which no natural algorithm can achieve sublinear regret.
We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes.