no code implementations • 3 Nov 2021 • Gautam Chandrasekaran, Ambuj Tewari
In contrast, it has been shown that handling online MDPs with communicating structure and bandit information incurs $\Omega(T^{2/3})$ regret even in the case of deterministic transitions.