Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.
State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s_{t}$, to the deterministic action $\mu\left(\mathbf{s}_{t}\right)$. At the beginning of an episode, the parameters $\theta_{\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\mathbf{a}_{t}$ is as follows:
$$ \mathbf{a}_{t}=\mu\left(\mathbf{s}_{t} ; \theta_{\mu}\right)+\epsilon\left(\mathbf{s}_{t} ; \theta_{\epsilon}\right), \quad \theta_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right) $$
This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.
In the case of a linear exploration function $\epsilon\left(\mathbf{s} ; \theta_{\epsilon}\right)=\theta_{\epsilon} \mathbf{s}$, by operation on Gaussian distributions, Rückstieß et al. show that the action element $\mathbf{a}_{j}$ is normally distributed:
$$ \pi]_{j}\left(\mathbf{a}_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu_{j}(\mathbf{s}), \hat{\sigma_{j}}^{2}\right) $$
where $\hat{\sigma}$ is a diagonal matrix with elements $\hat{\sigma}_{j}=\sqrt{\sum_{i}\left(\sigma_{i j} \mathbf{s}_{i}\right)^{2}}$.
Because we know the policy distribution, we can obtain the derivative of the log-likelihood $\log \pi(\mathbf{a} \mid \mathbf{s})$ with respect to the variance $\sigma$ :
$$ \frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}_{j}-\mu_{j}\right)^{2}-\hat{\sigma_{j}}^{2}}{\hat{\sigma}_{j}^{3}} \frac{\mathbf{s}_{i}^{2} \sigma_{i j}}{\hat{\sigma_{j}}} $$
This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.
For gSDE, two improvements are suggested:
Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Continuous Control | 1 | 33.33% |
Reinforcement Learning | 1 | 33.33% |
Reinforcement Learning (RL) | 1 | 33.33% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |