no code implementations • ICML 2020 • Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone
In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the given policy.