Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.
Take the case of offpolicy reinforcement learning, where the policy $\beta$ for collecting trajectories on rollout workers is different from the policy $\pi$ to optimize for. The objective function in an offpolicy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:
$$ J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\pi_{\theta}\left(a\mid{s}\right)\hat{A}_{\theta_{old}}\left(s, a\right)\right) $$
$$ J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right) $$
$$ J\left(\theta\right) = \mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}, a\sim{\beta}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$
When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\pi_{\theta_{old}}\left(a\mid{s}\right)$ and thus the objective function becomes:
$$ J\left(\theta\right) = \mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}, a\sim{\pi_{\theta_{old}}}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\pi_{\theta_{old}}\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$
TRPO aims to maximize the objective function $J\left(\theta\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KLdivergence to be small enough, within a parameter $\delta$:
$$ \mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}} \left[D_{KL}\left(\pi_{\theta_{old}}\left(.\mid{s}\right)\mid\mid\pi_{\theta}\left(.\mid{s}\right)\right)\right] \leq \delta$$
Source: Trust Region Policy OptimizationPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Reinforcement Learning (RL)  43  43.43% 
Continuous Control  10  10.10% 
Decision Making  4  4.04% 
Face AntiSpoofing  3  3.03% 
Face Recognition  3  3.03% 
MultiTask Learning  3  3.03% 
Atari Games  3  3.03% 
Benchmarking  2  2.02% 
Problem Decomposition  2  2.02% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 