Sarsa_INLINE_MATH_1 extends eligibility-traces to action-value methods. It has the same update rule as for TD_INLINE_MATH_1 but we use the action-value form of the TD erorr:
$$ \delta_{t} = R_{t+1} + \gamma\hat{q}\left(S_{t+1}, A_{t+1}, \mathbb{w}_{t}\right) - \hat{q}\left(S_{t}, A_{t}, \mathbb{w}_{t}\right) $$
and the action-value form of the eligibility trace:
$$ \mathbb{z}_{-1} = \mathbb{0} $$
$$ \mathbb{z}_{t} = \gamma\lambda\mathbb{z}_{t-1} + \nabla\hat{q}\left(S_{t}, A_{t}, \mathbb{w}_{t} \right), 0 \leq t \leq T$$
Source: Sutton and Barto, Reinforcement Learning, 2nd Edition
Paper | Code | Results | Date | Stars |
---|