Policy Gradient Methods

IMPALA, or the Importance Weighted Actor Learner Architecture, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using V-trace. Unlike the popular A3C-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations.

This type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.

Source: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures


Paper Code Results Date Stars


Task Papers Share
Reinforcement Learning (RL) 8 50.00%
Continuous Control 2 12.50%
Atari Games 2 12.50%
OpenAI Gym 2 12.50%
Image Captioning 1 6.25%
Edge-computing 1 6.25%