Why so pessimistic? Estimating uncertainties for offline RL through ensembles, and why their independence matters.

29 Sep 2021 · Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, Ofir Nachum ·

In order to achieve strong performance in offline reinforcement learning (RL), it is necessary to act conservatively with respect to confident lower-bounds on anticipated values of actions. Thus, a valuable approach would be to obtain high quality uncertainty estimates on action values. In current supervised learning literature, state-of-the-art approaches to uncertainty estimation and calibration rely on ensembling methods. In this work, we aim to transfer the success of ensembles from supervised learning to the setting of batch RL. We propose, MSG, a model-free dynamic programming based offline RL method that trains an ensemble of independent Q-functions, and updates a policy to act conservatively with respect to the uncertainties derived from the ensemble. Theoretically, by referring to the literature on infinite-width neural networks, we demonstrate the crucial dependence of the quality of uncertainty on the manner in which ensembling is performed, a phenomenon that arises due to the dynamic programming nature of RL and overlooked by existing offline RL methods. Our theoretical predictions are corroborated by pedagogical examples on toy MDPs, as well as empirical comparisons in benchmark continuous control domains. In the more challenging domains of the D4RL offline RL benchmark, MSG significantly surpasses highly well-tuned state-of-the-art methods in batch RL. Motivated by the success of MSG, we investigate whether efficient approximations to ensembles can be as effective. We demonstrate that while efficient variants outperform current state-of-the-art, they do not match MSG with deep ensembles. We hope our work engenders increased focus into deep network uncertainty estimation techniques directed for reinforcement learning.

PDF Abstract