POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state-wise costs from offline datasets. This arrangement provides an effective approach for the widespread application of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simultaneously. However previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work we propose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely we reframe the objective of multi-constraint offline RL by introducing the concept of Maximum Markov Decision Processes (MMDP). Subsequently we present a primal policy optimization algorithm to confront the multi-constraint problems which improves the stability and convergence speed of model training. Furthermore we propose a conditional Bellman operator to estimate cumulative and state-wise Q-values reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks particularly outperforming baseline algorithms in terms of safety. Our code is available at \href https://github.com/guanjiayi/poce github.POCE .
PDF Abstract