Policy Gradient Methods

Taylor Expansion Policy Optimization

Introduced by Tang et al. in Taylor Expansion Policy Optimization

TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the $k$-th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \mathbb{R} \rightarrow \mathbb{R}$, the $k$-th order Taylor expansion of $f\left(x\right)$ at $x_{0}$ is

$$f_{k}\left(x\right) = f\left(x_{0}\right)+\sum^{k}_{i=1}\left[f^{(i)}\left(x_{0}\right)/i!\right]\left(x−x_{0}\right)^{i}$$

where $f^{(i)}\left(x_{0}\right)$ are the $i$-th order derivatives at $x_{0}$. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x − x_{0}| < R\left(f, x_{0}\right)^{1}$. Second, when using the truncation as an approximation to the original function $f_{K}\left(x\right) \approx f\left(x\right)$, Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f_{K}\left(x\right)$ at any $x$ (target policy), we only require the behavior policy "data" at $x_{0}$ (i.e., derivatives $f^{(i)}\left(x_{0}\right)$).

Source: Taylor Expansion Policy Optimization

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Reinforcement Learning (RL) 1 100.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories