Exploration Strategies

Generalized State-Dependent Exploration

Introduced by Raffin et al. in Smooth Exploration for Robotic Reinforcement Learning

Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.

State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s_{t}$, to the deterministic action $\mu\left(\mathbf{s}_{t}\right)$. At the beginning of an episode, the parameters $\theta_{\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\mathbf{a}_{t}$ is as follows:

$$ \mathbf{a}_{t}=\mu\left(\mathbf{s}_{t} ; \theta_{\mu}\right)+\epsilon\left(\mathbf{s}_{t} ; \theta_{\epsilon}\right), \quad \theta_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right) $$

This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.

In the case of a linear exploration function $\epsilon\left(\mathbf{s} ; \theta_{\epsilon}\right)=\theta_{\epsilon} \mathbf{s}$, by operation on Gaussian distributions, Rückstieß et al. show that the action element $\mathbf{a}_{j}$ is normally distributed:

$$ \pi]_{j}\left(\mathbf{a}_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu_{j}(\mathbf{s}), \hat{\sigma_{j}}^{2}\right) $$

where $\hat{\sigma}$ is a diagonal matrix with elements $\hat{\sigma}_{j}=\sqrt{\sum_{i}\left(\sigma_{i j} \mathbf{s}_{i}\right)^{2}}$.

Because we know the policy distribution, we can obtain the derivative of the log-likelihood $\log \pi(\mathbf{a} \mid \mathbf{s})$ with respect to the variance $\sigma$ :

$$ \frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}_{j}-\mu_{j}\right)^{2}-\hat{\sigma_{j}}^{2}}{\hat{\sigma}_{j}^{3}} \frac{\mathbf{s}_{i}^{2} \sigma_{i j}}{\hat{\sigma_{j}}} $$

This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.

For gSDE, two improvements are suggested:

  1. We sample the parameters $\theta_{\epsilon}$ of the exploration function every $n$ steps instead of every episode.
  2. Instead of the state s, we can in fact use any features. We chose policy features $\mathbf{z}_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}_{\mu}}\right)$ (last layer before the deterministic output $\left.\mu(\mathbf{s})=\theta_{\mu} \mathbf{z}_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}_{\mu}}\right)\right)$ as input to the noise function $\epsilon\left(\mathbf{s} ; \theta_{\epsilon}\right)=\theta_{\epsilon} \mathbf{z}_{\mu}(\mathbf{s})$
Source: Smooth Exploration for Robotic Reinforcement Learning

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Continuous Control 1 50.00%
Reinforcement Learning (RL) 1 50.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories