Decentralized Distributed Proximal Policy Optimization (DD-PPO) is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement.
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.
Let $r_{t}\left(\theta\right)$ denote the probability ratio $r_{t}\left(\theta\right) = \frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)}$, so $r\left(\theta_{old}\right) = 1$. TRPO maximizes a “surrogate” objective:
$$ L^{v}\left({\theta}\right) = \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)})\hat{A}_{t}\right] = \hat{\mathbb{E}}_{t}\left[r_{t}\left(\theta\right)\hat{A}_{t}\right] $$
As a general abstraction, DD-PPO implements the following: at step $k$, worker $n$ has a copy of the parameters, $\theta^k_n$, calculates the gradient, $\delta \theta^k_n$, and updates $\theta$ via
$$ \theta^{k+1}_n = \text{ParamUpdate}\Big(\theta^{k}_n, \text{AllReduce}\big(\delta \theta^k_1, \ldots, \delta \theta^k_N\big)\Big) = \text{ParamUpdate}\Big(\theta^{k}_n, \frac{1}{N} \sum_{i=1}^{N} { \delta \theta^k_i} \Big) $$
where $\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).
Source: DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion FramesPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
PointGoal Navigation | 4 | 19.05% |
Navigate | 4 | 19.05% |
Reinforcement Learning | 3 | 14.29% |
Robot Navigation | 2 | 9.52% |
Deep Reinforcement Learning | 1 | 4.76% |
Problem Decomposition | 1 | 4.76% |
Reinforcement Learning (RL) | 1 | 4.76% |
Semantic Segmentation | 1 | 4.76% |
ObjectGoal Navigation | 1 | 4.76% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |