Sarsa is an on-policy TD control algorithm:
$$Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right) + \alpha\left[R_{t+1} + \gamma{Q}\left(S_{t+1}, A_{t+1}\right) - Q\left(S_{t}, A_{t}\right)\right] $$
This update is done after every transition from a nonterminal state $S_{t}$. if $S_{t+1}$ is terminal, then $Q\left(S_{t+1}, A_{t+1}\right)$ is defined as zero.
To design an on-policy control algorithm using Sarsa, we estimate $q_{\pi}$ for a behaviour policy $\pi$ and then change $\pi$ towards greediness with respect to $q_{\pi}$.
Source: Sutton and Barto, Reinforcement Learning, 2nd Edition
Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Reinforcement Learning (RL) | 37 | 37.37% |
Reinforcement Learning | 28 | 28.28% |
Decision Making | 4 | 4.04% |
Deep Reinforcement Learning | 4 | 4.04% |
OpenAI Gym | 3 | 3.03% |
Continuous Control | 3 | 3.03% |
Combinatorial Optimization | 2 | 2.02% |
Management | 2 | 2.02% |
Imitation Learning | 1 | 1.01% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |