V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\left(x_{t}, a_{t}, r_{t}\right)^{t=s+n}_{t=s}$ generated by the actor following some policy $\mu$. We can define the $n$-steps V-trace target for $V\left(x_{s}\right)$, our value approximation at state $x_{s}$ as:
$$ v_{s} = V\left(x_{s}\right) + \sum^{s+n-1}_{t=s}\gamma^{t-s}\left(\prod^{t-1}_{i=s}c_{i}\right)\delta_{t}V $$
Where $\delta_{t}V = \rho_{t}\left(r_{t} + \gamma{V}\left(x_{t+1}\right) - V\left(x_{t}\right)\right)$ is a temporal difference algorithm for $V$, and $\rho_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ and $c_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\bar{\rho} \geq \bar{c}$.
Source: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner ArchitecturesPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Reinforcement Learning | 18 | 24.66% |
Reinforcement Learning (RL) | 15 | 20.55% |
Starcraft | 7 | 9.59% |
Starcraft II | 7 | 9.59% |
Deep Reinforcement Learning | 5 | 6.85% |
Decision Making | 3 | 4.11% |
Atari Games | 3 | 4.11% |
Continuous Control | 2 | 2.74% |
OpenAI Gym | 2 | 2.74% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |