Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Kefan DongYuanhao WangXiaoyu ChenLiwei Wang

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP... (read more)

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract