We study episodic reinforcement learning under unknown adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
This regret bound depends only on the maximum rank $M$ of measurements rather than the number of qubits, which takes advantage of low-rank measurements.
We introduce a Multi-Armed Bandit algorithm with fairness constraints, where fairness is defined as a minimum rate that a task or a resource is assigned to a user.
We establish sub-linear regret bounds on the proposed notions of regret in both the online and bandit setting.
How should a robot that collaborates with multiple people decide upon the distribution of resources (e. g. social attention, or parts needed for an assembly)?
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.