BRAC+: Going Deeper with Behavior Regularized Offline Reinforcement Learning

1 Jan 2021 · Chi Zhang, Sanmukh Rao Kuppannagari, Viktor Prasanna ·

Online interactions with the environment to collect data samples for training a Reinforcement Learning agent is not always feasible due to economic and safety concerns. The goal of Offline Reinforcement Learning (RL) is to address this problem by learning effective policies using previously collected datasets. Standard off-policy RL algorithms are prone to overestimations of the values of out of distribution (less explored) actions and are hence unsuitable for Offline RL. Behavior regularization, which constraints the learned policy within the support set of the dataset, has been proposed to tackle the limitations of standard off-policy algorithms. In this paper, we improve the behavior regularized offline reinforcement learning and propose \emph{BRAC+}. We use an analytical upper bound on KL divergence as the behavior regularizor to reduce variance associated with sample based estimations. Additionally, we employ state-dependent Lagrange multipliers for the regularization term to avoid distributing KL divergence penalty across all states of the sampled batch. The proposed Lagrange multipliers allow more freedom of deviation to high probability (more explored) states leading to better rewards while simultaneously restricting low probability (least explored) states to prevent out of distribution actions. We also propose several practical enhancements to further improve the performance. On challenging locomotion offline RL benchmarks, BRAC+ matches the state-of-the-art approaches on single-modal datasets and outperforms them on multi-modal datasets.

PDF Abstract