In an episodic Markov Decision Process (MDP) problem, an online algorithm chooses from a set of actions in a sequence of $H$ trials, where $H$ is the episode length, in order to maximize the total payoff of the chosen actions. Qlearning, as the most popular modelfree reinforcement learning (RL) algorithm, directly parameterizes and updates value functions without explicitly modeling the environment... (read more)
