Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\left(\pi, \beta\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:
$$ \Delta{Q}^{\text{imp}}\left(S_{t}, A_{t}\right) = \gamma^{t}\prod_{1\leq{\tau}\leq{t}}\frac{\pi\left(A_{\tau}\mid{S_{\tau}}\right)}{\beta\left(A_{\tau}\mid{S_{\tau}}\right)}\delta_{t} $$
This product term can lead to high variance, so Retrace modifies $\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:
$$ \Delta{Q}^{\text{imp}}\left(S_{t}, A_{t}\right) = \gamma^{t}\prod_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A_{\tau}\mid{S_{\tau}}\right)}{\beta\left(A_{\tau}\mid{S_{\tau}}\right)}\right)\delta_{t} $$
Source: Safe and Efficient Off-Policy Reinforcement LearningPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Problem Decomposition | 2 | 13.33% |
General Classification | 2 | 13.33% |
Atari Games | 2 | 13.33% |
Automatic Speech Recognition | 1 | 6.67% |
Speech Recognition | 1 | 6.67% |
Face Anti-Spoofing | 1 | 6.67% |
Face Recognition | 1 | 6.67% |
Time Series | 1 | 6.67% |
Time Series Classification | 1 | 6.67% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |