Decentralized Distributed Proximal Policy Optimization (DDPPO) is a method for distributed reinforcement learning in resourceintensive simulated environments. DDPPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement.
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only firstorder optimization.
Let $r_{t}\left(\theta\right)$ denote the probability ratio $r_{t}\left(\theta\right) = \frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)}$, so $r\left(\theta_{old}\right) = 1$. TRPO maximizes a “surrogate” objective:
$$ L^{v}\left({\theta}\right) = \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)})\hat{A}_{t}\right] = \hat{\mathbb{E}}_{t}\left[r_{t}\left(\theta\right)\hat{A}_{t}\right] $$
As a general abstraction, DDPPO implements the following: at step $k$, worker $n$ has a copy of the parameters, $\theta^k_n$, calculates the gradient, $\delta \theta^k_n$, and updates $\theta$ via
$$ \theta^{k+1}_n = \text{ParamUpdate}\Big(\theta^{k}_n, \text{AllReduce}\big(\delta \theta^k_1, \ldots, \delta \theta^k_N\big)\Big) = \text{ParamUpdate}\Big(\theta^{k}_n, \frac{1}{N} \sum_{i=1}^{N} { \delta \theta^k_i} \Big) $$
where $\text{ParamUpdate}$ is any firstorder optimization technique (e.g. gradient descent) and $\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (nearlinear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).
Source: DDPPO: Learning NearPerfect PointGoal Navigators from 2.5 Billion FramesPaper  Code  Results  Date  Stars 

Task  Papers  Share 

PointGoal Navigation  4  25.00% 
Navigate  4  25.00% 
Robot Navigation  2  12.50% 
Problem Decomposition  1  6.25% 
Reinforcement Learning (RL)  1  6.25% 
Semantic Segmentation  1  6.25% 
Visual Odometry  1  6.25% 
Autonomous Navigation  1  6.25% 
Scene Understanding  1  6.25% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 