no code implementations • 7 Aug 2023 • Pengfei Wu, Chen Chen, Dexiang Lai, Jian Zhong
Instead of integrating the constraint violation penalty with the reward function, its actor gradients are estimated by a Lagrange advantage function which is derived from two critic systems based on economic reward and violation cost.