Target Policy Smoothing

Introduced by Fujimoto et al. in Addressing Function Approximation Error in Actor-Critic Methods

Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral.

The modified target update is:

$$ y = r + \gamma{Q}_{\theta'}\left(s', \pi_{\theta'}\left(s'\right) + \epsilon \right) $$

$$ \epsilon \sim \text{clip}\left(\mathcal{N}\left(0, \sigma\right), -c, c \right) $$

where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\sigma$.

Source: Addressing Function Approximation Error in Actor-Critic Methods


Paper Code Results Date Stars


Task Papers Share
Continuous Control 19 39.58%
OpenAI Gym 6 12.50%
Autonomous Driving 4 8.33%
Decision Making 4 8.33%
Meta-Learning 3 6.25%
Atari Games 2 4.17%
energy management 1 2.08%
Imitation Learning 1 2.08%
Feature Engineering 1 2.08%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign