TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the $k$th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trustregion policy optimization and offpolicy corrections. Taylor expansions share highlevel similarities with both trust region policy search and offpolicy corrections. To get highlevel intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth realvalued function on the real line $f : \mathbb{R} \rightarrow \mathbb{R}$, the $k$th order Taylor expansion of $f\left(x\right)$ at $x_{0}$ is
$$f_{k}\left(x\right) = f\left(x_{0}\right)+\sum^{k}_{i=1}\left[f^{(i)}\left(x_{0}\right)/i!\right]\left(x−x_{0}\right)^{i}$$
where $f^{(i)}\left(x_{0}\right)$ are the $i$th order derivatives at $x_{0}$. First, a common feature shared by Taylor expansions and trustregion policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trustregion constraint is required $x − x_{0} < R\left(f, x_{0}\right)^{1}$. Second, when using the truncation as an approximation to the original function $f_{K}\left(x\right) \approx f\left(x\right)$, Taylor expansions satisfy the requirement of offpolicy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f_{K}\left(x\right)$ at any $x$ (target policy), we only require the behavior policy "data" at $x_{0}$ (i.e., derivatives $f^{(i)}\left(x_{0}\right)$).
Source: Taylor Expansion Policy OptimizationPaper  Code  Results  Date  Stars 

Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 