no code implementations • 1 Jan 2024 • Honghao Wei, Xiyue Peng, Xin Liu, Arnob Ghosh
Theoretically, we demonstrate that when the actor employs a no-regret optimization oracle, SATAC achieves two guarantees: (i) For the first time in the offline RL setting, we establish that SATAC can produce a policy that outperforms the behavior policy while maintaining the same level of safety, which is critical to designing an algorithm for offline RL.