SVPG Explained | Papers With Code

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Stein Variational Policy Gradient**, or **SVPG**, is a policy gradient based method in reinforcement learning that uses Stein Variational Gradient Descent to allow simultaneous exploitation and exploration of multiple policies. Unlike traditional policy optimization which attempts to learn a single policy, SVPG models a distribution of policy parameters, where samples from this distribution will represent strong policies.  SVPG optimizes this distribution of policy parameters with (relative) [entropy regularization](https://paperswithcode.com/method/entropy-regularization). The (relative) entropy term explicitly encourages exploration in the parameter space while also optimizing the expected utility of polices drawn from this distribution. Stein variational gradient descent (SVGD) is then used to optimize this distribution. SVGD leverages efficient deterministic dynamics to transport a set of particles to approximate given target posterior distributions.

The update takes the form:

$$ $$

$$ \nabla\theta\_i = \frac{1} {n}\sum\_{j=1}^n \nabla\_{\theta\_{j}} \left(\frac{1}{\alpha} J(\theta\_{j}) + \log q\_0(\theta\_j)\right)k(\theta\_j, \theta\_i) + \nabla\_{\theta\_j} k(\theta\_j, \theta\_i)$$

Note that here the magnitude of $\alpha$ adjusts the relative importance between the policy gradient and the prior term $\nabla_{\theta_j} \left(\frac{1}{\alpha} J(\theta_j) + \log q_0(\theta_j)\right)k(\theta_j, \theta_i)$ and the repulsive term $\nabla_{\theta_j} k(\theta_j, \theta_i)$. The repulsive functional is used to diversify particles to enable parameter exploration. A suitable $\alpha$ provides a good trade-off between exploitation and exploration. If $\alpha$ is too large, the Stein gradient would only drive the particles to be consistent with the prior $q_0$. As $\alpha \to 0$, this algorithm is reduced to running $n$ copies of independent policy gradient algorithms, if $\{\theta_i\}$ are initialized very differently. A careful annealing scheme of $\alpha$ allows efficient exploration in the beginning of training and later focuses on exploitation towards the end of training.

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2020-06-05_at_1.47.54_PM.png Clear
Change:

Attached collections:

POLICY GRADIENT METHODS

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Reinforcement Learning (RL)	2	66.67%
Continuous Control	1	33.33%

Stein Variational Policy Gradient

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove