2 code implementations • 5 Feb 2021 • Ang A. Li, Zongqing Lu, Chenglin Miao
Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of $|\text{TD}|$ and "on-policyness" of the experiences.